RELATÓRIO DE ERRO
versão kubeadm (use kubeadm version
):
kubeadm version: & version.Info {Major: "1", Minor: "10", GitVersion: "v1.10.0", GitCommit: "fc32d2f3698e36b93322a3465f63a14e9f0eaead", GitTreeState: "clean", BuildDate: "2018-03-26T16: 44 10Z ", GoVersion:" go1.9.3 ", Compilador:" gc ", Plataforma:" linux / amd64 "}
Meio Ambiente :
kubectl version
):Versão do cliente: version.Info {Major: "1", Minor: "9", GitVersion: "v1.9.6", GitCommit: "9f8ebd171479bec0ada837d7ee641dec2f8c6dd1", GitTreeState: "clean", BuildDate: "2018-03-21T15: 21 50Z ", GoVersion:" go1.9.3 ", Compilador:" gc ", Plataforma:" linux / amd64 "}
Versão do servidor: version.Info {Major: "1", Minor: "9", GitVersion: "v1.9.6", GitCommit: "9f8ebd171479bec0ada837d7ee641dec2f8c6dd1", GitTreeState: "clean", BuildDate: "2018-03-21T15: 31Z ", GoVersion:" go1.9.3 ", Compilador:" gc ", Plataforma:" linux / amd64 "}
Scaleway baremetal C2S
Ubuntu Xenial (16.04 LTS) (GNU / Linux 4.4.122-mainline-rev1 x86_64)
uname -a
):Linux amd64-master-1 4.4.122-mainline-rev1 # 1 SMP Dom 18 de março 10:44:19 UTC 2018 x86_64 x86_64 x86_64 GNU / Linux
Tentando atualizar de 1.9.6 para 1.10.0 Estou recebendo este erro:
kubeadm upgrade apply v1.10.0
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[upgrade/version] You have chosen to change the cluster version to "v1.10.0"
[upgrade/versions] Cluster version: v1.9.6
[upgrade/versions] kubeadm version: v1.10.0
[upgrade/confirm] Are you sure you want to proceed with the upgrade? [y/N]: y
[upgrade/prepull] Will prepull images for components [kube-apiserver kube-controller-manager kube-scheduler]
[upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.10.0"...
[etcd] Wrote Static Pod manifest for a local etcd instance to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests411909119/etcd.yaml"
[certificates] Generated etcd/ca certificate and key.
[certificates] Generated etcd/server certificate and key.
[certificates] etcd/server serving cert is signed for DNS names [localhost] and IPs [127.0.0.1]
[certificates] Generated etcd/peer certificate and key.
[certificates] etcd/peer serving cert is signed for DNS names [arm-master-1] and IPs [10.1.244.57]
[certificates] Generated etcd/healthcheck-client certificate and key.
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/etcd.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests180476754/etcd.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/apply] FATAL: fatal error when trying to upgrade the etcd cluster: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: [timed out waiting for the condition], rolled the state back to pre-upgrade state
Atualização bem-sucedida
Instale os pacotes 1.9.6 e init um cluster 1.9.6:
curl -fsSL https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
echo "deb http://apt.kubernetes.io/ kubernetes-xenial main" | tee /etc/apt/sources.list.d/kubernetes.list
apt-get update -qq
apt-get install -qy kubectl=1.9.6-00
apt-get install -qy kubelet=1.9.6-00
apt-get install -qy kubeadm=1.9.6-00
Edite o kubeadm-config e altere o featureGates de string para mapa, conforme relatado em https://github.com/kubernetes/kubernetes/issues/61764 .
kubectl -n kube-system edit cm kubeadm-config
....
featureGates: {}
....
Baixe kubeadm 1.10.0 e execute kubeadm upgrade plan
e kubeadm upgrade apply v1.10.0
.
Trabalhando na reprodução desse bug localmente.
Depois de tentar novamente por 10 vezes, finalmente funcionou
Aqui está o meu diff de manifesto etcd
`` ` root @ vagrant : ~ # diff /etc/kubernetes/manifests/etcd.yaml /etc/kubernetes/tmp/kubeadm-backup-manifests858209931/etcd.yaml
16,17c16,17
<- --listen-client-urls = https://127.0.0.1 : 2379
- --listen-client-urls=http://127.0.0.1:2379 - --advertise-client-urls=http://127.0.0.1:2379
19,27c19
<- --key-file = / etc / kubernetes / pki / etcd / server.key
<- --trusted-ca-file = / etc / kubernetes / pki / etcd / ca.crt
<- --peer-cert-file = / etc / kubernetes / pki / etcd / peer.crt
<- --peer-key-file = / etc / kubernetes / pki / etcd / peer.key
<- --client-cert-auth = true
<- --peer-client-cert-auth = true
<- --cert-file = / etc / kubernetes / pki / etcd / server.crt
<- --peer-trusted-ca-file = / etc / kubernetes / pki / etcd / ca.crt<image: gcr.io/google_containers/etcd-amd64:3.1.12
image: gcr.io/google_containers/etcd-amd64:3.1.11
29,35d20
<exec:
<comando:
<- / bin / sh
<- -ec
<- ETCDCTL_API = 3 etcdctl --endpoints = 127.0.0.1: 2379 --cacert = / etc / kubernetes / pki / etcd / ca.crt
<--cert = / etc / kubernetes / pki / etcd / healthcheck-client.crt --key = / etc / kubernetes / pki / etcd / healthcheck-client.key
<obter foo
36a22,26
httpGet:
host: 127.0.0.1
caminho: / saúde
porta: 2379
esquema: HTTP
43,45c33
<nome: etcd-data
<- mountPath: / etc / kubernetes / pki / etcd<nome: etcd-certs
name: etcd
51,55c39
<nome: etcd-data
<- hostPath:
<caminho: / etc / kubernetes / pki / etcd
<tipo: DirectoryOrCreate<nome: etcd-certs
name: etcd
root @ vagrant : ~ # ls / etc / kubernetes / pki / etcd
ca.crt ca.key healthcheck-client.crt healthcheck-client.key peer.crt peer.key server.crt server.key```
Cluster 1.9.6 no Ubuntu 17.10 Vagrant:
root<strong i="6">@vagrant</strong>:/vagrant# 1.10_kubernetes/server/bin/kubeadm upgrade apply v1.10.0
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[upgrade/version] You have chosen to change the cluster version to "v1.10.0"
[upgrade/versions] Cluster version: v1.9.6
[upgrade/versions] kubeadm version: v1.10.0
[upgrade/confirm] Are you sure you want to proceed with the upgrade? [y/N]: y
[upgrade/prepull] Will prepull images for components [kube-apiserver kube-controller-manager kube-scheduler]
[upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.10.0"...
[etcd] Wrote Static Pod manifest for a local etcd instance to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests262738652/etcd.yaml"
[certificates] Generated etcd/ca certificate and key.
[certificates] Generated etcd/server certificate and key.
[certificates] etcd/server serving cert is signed for DNS names [localhost] and IPs [127.0.0.1]
[certificates] Generated etcd/peer certificate and key.
[certificates] etcd/peer serving cert is signed for DNS names [vagrant] and IPs [10.0.2.15]
[certificates] Generated etcd/healthcheck-client certificate and key.
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/etcd.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests858209931/etcd.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[apiclient] Error getting Pods with label selector "component=etcd" [the server was unable to return a response in the time allotted, but may still be processing the request (get pods)]
[apiclient] Error getting Pods with label selector "component=etcd" [Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods?labelSelector=component%3Detcd: http2: server sent GOAWAY and closed the connection; LastStreamID=27, ErrCode=NO_ERROR, debug=""]
[apiclient] Error getting Pods with label selector "component=etcd" [Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods?labelSelector=component%3Detcd: net/http: TLS handshake timeout]
[apiclient] Error getting Pods with label selector "component=etcd" [the server was unable to return a response in the time allotted, but may still be processing the request (get pods)]
[apiclient] Error getting Pods with label selector "component=etcd" [Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods?labelSelector=component%3Detcd: http2: server sent GOAWAY and closed the connection; LastStreamID=3, ErrCode=NO_ERROR, debug=""]
[upgrade/apply] FATAL: fatal error when trying to upgrade the etcd cluster: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: [timed out waiting for the condition], rolled the state back to pre-upgrade state
Este é meu ambiente de reprodução: https://github.com/stealthybox/vagrant-kubeadm-testing
Altere essas linhas para 1.9.6-00
para o bootstrap: https://github.com/stealthybox/vagrant-kubeadm-testing/blob/9d4493e990c9bd742107b317641267c3ef3640cd/Vagrantfile#L18 -L20
Em seguida, baixe os binários do servidor 1.10 no repo e eles estarão disponíveis no convidado em /vagrant
https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.10.md#server -binaries
registros relacionados ao kubelet etcd:
root<strong i="6">@vagrant</strong>:~# journalctl -xefu kubelet | grep -i etcd
Mar 28 16:32:07 vagrant kubelet[14676]: W0328 16:32:07.808776 14676 status_manager.go:459] Failed to get status for pod "etcd-vagrant_kube-system(7278f85057e8bf5cb81c9f96d3b25320)": Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 16:32:07 vagrant kubelet[14676]: I0328 16:32:07.880412 14676 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "etcd" (UniqueName: "kubernetes.io/host-path/7278f85057e8bf5cb81c9f96d3b25320-etcd") pod "etcd-vagrant" (UID: "7278f85057e8bf5cb81c9f96d3b25320")
Mar 28 16:34:27 vagrant kubelet[14676]: W0328 16:34:27.472534 14676 status_manager.go:459] Failed to get status for pod "etcd-vagrant_kube-system(7278f85057e8bf5cb81c9f96d3b25320)": Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 16:57:33 vagrant kubelet[14676]: W0328 16:57:33.683648 14676 kubelet.go:1597] Deleting mirror pod "etcd-vagrant_kube-system(122348c3-32a6-11e8-8dc5-080027d6be16)" because it is outdated
Mar 28 16:57:33 vagrant kubelet[14676]: I0328 16:57:33.725564 14676 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "etcd-certs" (UniqueName: "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-certs") pod "etcd-vagrant" (UID: "37936d2107e31b457cada6c2433469f1")
Mar 28 16:57:33 vagrant kubelet[14676]: I0328 16:57:33.725637 14676 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "etcd-data" (UniqueName: "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-data") pod "etcd-vagrant" (UID: "37936d2107e31b457cada6c2433469f1")
Mar 28 16:57:35 vagrant kubelet[14676]: E0328 16:57:35.484901 14676 kuberuntime_container.go:66] Can't make a ref to pod "etcd-vagrant_kube-system(7278f85057e8bf5cb81c9f96d3b25320)", container etcd: selfLink was empty, can't make reference
Mar 28 16:57:35 vagrant kubelet[14676]: I0328 16:57:35.889458 14676 reconciler.go:191] operationExecutor.UnmountVolume started for volume "etcd" (UniqueName: "kubernetes.io/host-path/7278f85057e8bf5cb81c9f96d3b25320-etcd") pod "7278f85057e8bf5cb81c9f96d3b25320" (UID: "7278f85057e8bf5cb81c9f96d3b25320")
Mar 28 16:57:35 vagrant kubelet[14676]: I0328 16:57:35.889595 14676 operation_generator.go:643] UnmountVolume.TearDown succeeded for volume "kubernetes.io/host-path/7278f85057e8bf5cb81c9f96d3b25320-etcd" (OuterVolumeSpecName: "etcd") pod "7278f85057e8bf5cb81c9f96d3b25320" (UID: "7278f85057e8bf5cb81c9f96d3b25320"). InnerVolumeSpecName "etcd". PluginName "kubernetes.io/host-path", VolumeGidValue ""
Mar 28 16:57:35 vagrant kubelet[14676]: I0328 16:57:35.989892 14676 reconciler.go:297] Volume detached for volume "etcd" (UniqueName: "kubernetes.io/host-path/7278f85057e8bf5cb81c9f96d3b25320-etcd") on node "vagrant" DevicePath ""
Mar 28 16:58:03 vagrant kubelet[14676]: E0328 16:58:03.688878 14676 mirror_client.go:88] Failed deleting a mirror pod "etcd-vagrant_kube-system": Timeout: request did not complete within allowed duration
Mar 28 16:58:03 vagrant kubelet[14676]: E0328 16:58:03.841447 14676 event.go:200] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"etcd-vagrant.152023ff626cfbc5", GenerateName:"", Namespace:"kube-system", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"etcd-vagrant", UID:"37936d2107e31b457cada6c2433469f1", APIVersion:"v1", ResourceVersion:"", FieldPath:""}, Reason:"SuccessfulMountVolume", Message:"MountVolume.SetUp succeeded for volume \"etcd-certs\" ", Source:v1.EventSource{Component:"kubelet", Host:"vagrant"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbea7103f713e59c5, ext:1534226953099, loc:(*time.Location)(0x5859e60)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbea7103f713e59c5, ext:1534226953099, loc:(*time.Location)(0x5859e60)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Timeout: request did not complete within allowed duration' (will not retry!)
Mar 28 16:58:33 vagrant kubelet[14676]: E0328 16:58:33.844276 14676 event.go:200] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"etcd-vagrant.152023ff626cfb82", GenerateName:"", Namespace:"kube-system", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"etcd-vagrant", UID:"37936d2107e31b457cada6c2433469f1", APIVersion:"v1", ResourceVersion:"", FieldPath:""}, Reason:"SuccessfulMountVolume", Message:"MountVolume.SetUp succeeded for volume \"etcd-data\" ", Source:v1.EventSource{Component:"kubelet", Host:"vagrant"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbea7103f713e5982, ext:1534226953033, loc:(*time.Location)(0x5859e60)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbea7103f713e5982, ext:1534226953033, loc:(*time.Location)(0x5859e60)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Timeout: request did not complete within allowed duration' (will not retry!)
Mar 28 16:59:03 vagrant kubelet[14676]: E0328 16:59:03.692450 14676 kubelet.go:1612] Failed creating a mirror pod for "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)": the server was unable to return a response in the time allotted, but may still be processing the request (post pods)
Mar 28 16:59:03 vagrant kubelet[14676]: E0328 16:59:03.848007 14676 event.go:200] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"etcd-vagrant.152023ff641f915f", GenerateName:"", Namespace:"kube-system", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"etcd-vagrant", UID:"7278f85057e8bf5cb81c9f96d3b25320", APIVersion:"v1", ResourceVersion:"", FieldPath:"spec.containers{etcd}"}, Reason:"Killing", Message:"Killing container with id docker://etcd:Need to kill Pod", Source:v1.EventSource{Component:"kubelet", Host:"vagrant"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbea7103f72f0ef5f, ext:1534255433999, loc:(*time.Location)(0x5859e60)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbea7103f72f0ef5f, ext:1534255433999, loc:(*time.Location)(0x5859e60)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Timeout: request did not complete within allowed duration' (will not retry!)
Mar 28 16:59:14 vagrant kubelet[14676]: W0328 16:59:14.472661 14676 kubelet.go:1597] Deleting mirror pod "etcd-vagrant_kube-system(122348c3-32a6-11e8-8dc5-080027d6be16)" because it is outdated
Mar 28 16:59:14 vagrant kubelet[14676]: W0328 16:59:14.473138 14676 status_manager.go:459] Failed to get status for pod "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)": Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 16:59:14 vagrant kubelet[14676]: E0328 16:59:14.473190 14676 mirror_client.go:88] Failed deleting a mirror pod "etcd-vagrant_kube-system": Delete https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 16:59:14 vagrant kubelet[14676]: E0328 16:59:14.473658 14676 kubelet.go:1612] Failed creating a mirror pod for "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)": Post https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 16:59:15 vagrant kubelet[14676]: W0328 16:59:15.481336 14676 kubelet.go:1597] Deleting mirror pod "etcd-vagrant_kube-system(122348c3-32a6-11e8-8dc5-080027d6be16)" because it is outdated
Mar 28 16:59:15 vagrant kubelet[14676]: E0328 16:59:15.483705 14676 mirror_client.go:88] Failed deleting a mirror pod "etcd-vagrant_kube-system": Delete https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 16:59:15 vagrant kubelet[14676]: E0328 16:59:15.497391 14676 kubelet.go:1612] Failed creating a mirror pod for "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)": Post https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 17:00:34 vagrant kubelet[14676]: W0328 17:00:34.475851 14676 kubelet.go:1597] Deleting mirror pod "etcd-vagrant_kube-system(122348c3-32a6-11e8-8dc5-080027d6be16)" because it is outdated
Mar 28 17:01:07 vagrant kubelet[14676]: W0328 17:01:07.720076 14676 status_manager.go:459] Failed to get status for pod "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)": Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: http2: server sent GOAWAY and closed the connection; LastStreamID=47, ErrCode=NO_ERROR, debug=""
Mar 28 17:01:07 vagrant kubelet[14676]: E0328 17:01:07.720107 14676 mirror_client.go:88] Failed deleting a mirror pod "etcd-vagrant_kube-system": Delete https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: http2: server sent GOAWAY and closed the connection; LastStreamID=47, ErrCode=NO_ERROR, debug=""; some request body already written
Mar 28 17:01:07 vagrant kubelet[14676]: E0328 17:01:07.725335 14676 kubelet.go:1612] Failed creating a mirror pod for "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)": Post https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 17:01:07 vagrant kubelet[14676]: I0328 17:01:07.728709 14676 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "etcd" (UniqueName: "kubernetes.io/host-path/7278f85057e8bf5cb81c9f96d3b25320-etcd") pod "etcd-vagrant" (UID: "7278f85057e8bf5cb81c9f96d3b25320")
Mar 28 17:01:07 vagrant kubelet[14676]: W0328 17:01:07.734475 14676 status_manager.go:459] Failed to get status for pod "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)": Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 17:01:07 vagrant kubelet[14676]: W0328 17:01:07.740642 14676 status_manager.go:459] Failed to get status for pod "etcd-vagrant_kube-system(7278f85057e8bf5cb81c9f96d3b25320)": Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 17:01:09 vagrant kubelet[14676]: E0328 17:01:09.484412 14676 kuberuntime_container.go:66] Can't make a ref to pod "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)", container etcd: selfLink was empty, can't make reference
Mar 28 17:01:09 vagrant kubelet[14676]: I0328 17:01:09.848794 14676 reconciler.go:191] operationExecutor.UnmountVolume started for volume "etcd-certs" (UniqueName: "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-certs") pod "37936d2107e31b457cada6c2433469f1" (UID: "37936d2107e31b457cada6c2433469f1")
Mar 28 17:01:09 vagrant kubelet[14676]: I0328 17:01:09.849282 14676 reconciler.go:191] operationExecutor.UnmountVolume started for volume "etcd-data" (UniqueName: "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-data") pod "37936d2107e31b457cada6c2433469f1" (UID: "37936d2107e31b457cada6c2433469f1")
Mar 28 17:01:09 vagrant kubelet[14676]: I0328 17:01:09.849571 14676 operation_generator.go:643] UnmountVolume.TearDown succeeded for volume "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-data" (OuterVolumeSpecName: "etcd-data") pod "37936d2107e31b457cada6c2433469f1" (UID: "37936d2107e31b457cada6c2433469f1"). InnerVolumeSpecName "etcd-data". PluginName "kubernetes.io/host-path", VolumeGidValue ""
Mar 28 17:01:09 vagrant kubelet[14676]: I0328 17:01:09.849503 14676 operation_generator.go:643] UnmountVolume.TearDown succeeded for volume "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-certs" (OuterVolumeSpecName: "etcd-certs") pod "37936d2107e31b457cada6c2433469f1" (UID: "37936d2107e31b457cada6c2433469f1"). InnerVolumeSpecName "etcd-certs". PluginName "kubernetes.io/host-path", VolumeGidValue ""
Mar 28 17:01:09 vagrant kubelet[14676]: I0328 17:01:09.949925 14676 reconciler.go:297] Volume detached for volume "etcd-certs" (UniqueName: "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-certs") on node "vagrant" DevicePath ""
Mar 28 17:01:09 vagrant kubelet[14676]: I0328 17:01:09.949975 14676 reconciler.go:297] Volume detached for volume "etcd-data" (UniqueName: "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-data") on node "vagrant" DevicePath ""
A solução alternativa atual é continuar tentando a atualização e em algum ponto ela terá sucesso.
@stealthybox Por acaso você obtém logs do docker para o contêiner etcd? além disso, grep -i etcd
pode estar mascarando parte da saída do kubelet, por exemplo, algumas mensagens de erro que não têm o nome do contêiner, mas ainda são relevantes.
Acabei de chegar a outro caso estranho estranho relacionado a esse bug. A atualização do kubeadm marcou a atualização do etcd como concluída antes de a nova imagem etcd ser extraída e o novo pod estático ser implantado. Isso faz com que a atualização atinja o tempo limite em uma etapa posterior e a reversão da atualização falhe. Isso também deixa o cluster em um estado quebrado. É necessário restaurar o manifesto do pod estático etcd original para recuperar o cluster.
Ah, sim, também estou preso aí. meu cluster está completamente inativo. Alguém pode compartilhar algumas instruções sobre como resgatar desse estado?
Estive lá na minha segunda tentativa de upgrade, assim como @detiber descreveu, bastante doloroso. :chore:
Encontrei algum material de backup em / etc / kubernetes / tmp, sentindo que o etcd pode ser o culpado, copiei seu manifesto antigo sobre o novo na pasta manifestos. Naquele ponto eu não tinha nada a perder, porque perdi completamente o controle do cluster. Então, não me lembro exatamente, mas acho que reiniciei a máquina inteira e, mais tarde, fiz o downgrade de todas as coisas para a v1.9.6. Eventualmente, ganhei o controle do cluster e perdi qualquer motivação para mexer com a v1.10.0 novamente. Não foi nada divertido ...
Se você reverter o manifesto estático do pod etcd de /etc/kubernetes/tmp
, é importante também reverter o manifesto apiserver para a versão 1.9 devido à nova configuração de TLS em 1.10.
^ você provavelmente não precisará fazer isso porque acredito que a atualização do etcd bloqueia o resto da atualização do plano de controle.
Parece que apenas o manifesto etcd não é revertido em uma atualização com falha, todo o resto está bem. Depois de mover o manifesto de backup e reiniciar o kubelet, tudo volta bem.
Eu enfrentei o mesmo problema de tempo limite e o kubeadm reverteu o manifesto kube-apiserv para 1.9.6, mas deixou o manifesto do etcd como está (leia: com TLS habilitado), obviamente levando o apiserv a falhar miseravelmente, quebrando efetivamente meu nó mestre. Bom candidato para um relatório de problema separado, suponho.
@dvdmuckle @codepainters , infelizmente depende de qual componente atinge a condição de corrida (etcd ou servidor api) se a reversão foi bem-sucedida. Eu encontrei uma correção para a condição de corrida, mas ela quebra completamente a atualização do kubeadm. Estou trabalhando com @stealthybox para tentar encontrar um caminho adequado para corrigir a atualização de maneira adequada.
@codepainters Acho que é o mesmo problema.
Existem alguns problemas subjacentes que causam esse problema:
Como resultado, a atualização só é bem-sucedida atualmente quando ocorre uma atualização de status do pod para o pod etcd que faz com que o hash mude antes de o kubelet pegar o novo manifesto estático para o etcd. Além disso, o servidor api precisa permanecer disponível para a primeira parte da atualização do apiserver quando o conjunto de ferramentas de atualização está consultando a api antes de atualizar o manifesto do apiserver.
@detiber e eu recebemos uma ligação para discutir as mudanças que precisamos fazer no processo de atualização.
Planejamos implementar 3 correções para esse bug nas versões de patch
Remova o etcd TLS da atualização.
O loop de atualização atual faz modificações em lote por componente de maneira serial.
A atualização de um componente não tem conhecimento das configurações de componentes dependentes.
A verificação de uma atualização requer que o APIServer esteja disponível para verificar o status do pod.
Etcd TLS requer uma alteração de configuração etcd + apiserver acoplada que quebra este contrato.
Essa é a mudança mínima viável para corrigir esse problema e deixa os clusters atualizados com etcd inseguros.
Corrija a condição de corrida de hash do pod de espelho na mudança de status do pod.
https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/phases/upgrade/staticpods.go#L189.
As atualizações agora estarão corretas assumindo a compatibilidade entre o etcd e os sinalizadores apiserver.
Atualize o TLS especificamente em uma fase separada.
Etcd e o APIServer precisam ser atualizados juntos.
kubeadm alpha phase ensure-etcd-tls
?.
Esta fase deve ser executada independentemente de uma atualização de cluster.
Durante uma atualização de cluster, esta fase deve ser executada antes de atualizar todos os componentes.
Para 1.11 , queremos:
alternativa: Use o CRI para obter informações do pod (demo viável usando crictl
).
advertência: CRI em dockershim e possivelmente outros tempos de execução de contêiner não suporta compatibilidade com versões anteriores para alterações de quebra de CRI.
PR para resolver a condição de corrida de atualização de pod estático: https://github.com/kubernetes/kubernetes/pull/61942
PR para release-1.10 branch: https://github.com/kubernetes/kubernetes/pull/61954
@detiber , você se importa em explicar de que condição racial estamos falando? Não estou tão familiarizado com os componentes internos do kubeadm, mas parece interessante.
@codepainters consulte https://github.com/kubernetes/kubeadm/issues/740#issuecomment -377263347
Para sua informação - mesmo problema / questão atualizando a partir de 1.9.3
Tentei a solução alternativa de tentar novamente várias vezes. Finalmente atingiu a condição de corrida com o servidor API e a atualização não pôde ser revertida.
@stealthybox thx, não entendi na primeira leitura.
Estou tendo o mesmo problema .. [ERROR APIServerHealth]: o API Server não está íntegro; / healthz não retornou "ok"
[ERROR MasterNodesReady]: não foi possível listar os masters no cluster: Obtenha https ....... durante a atualização. Por favor me ajude com isso. Estou atualizando de 1.9.3 para 1.10.0. Inicialmente, ele foi capaz de chegar a um certo ponto de "[upgrade / staticpods] Aguardando o kubelet reiniciar o componente".
A solução temporária é garantir certificados e atualizar os pods etcd e apiserver, ignorando as verificações.
Certifique-se de verificar seu Config e adicionar sinalizadores para seu caso de uso:
kubectl -n kube-system edit cm kubeadm-config # change featureFlags
...
featureGates: {}
...
kubeadm alpha phase certs all
kubeadm alpha phase etcd local
kubeadm alpha phase controlplane all
kubeadm alpha phase upload-config
Obrigado @stealthybox
Para mim, o processo upgrade apply
parou em [upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.10.1"...
porém o cluster foi atualizado com sucesso.
@stealthybox Não tenho certeza, mas parece que algo está quebrado após essas etapas, porque
kubeadm upgrade plan
trava depois disso:
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[upgrade] Fetching available versions to upgrade to
[upgrade/versions] Cluster version: v1.10.1
[upgrade/versions] kubeadm version: v1.10.1
[upgrade/versions] Latest stable version: v1.10.1
Quando apliquei a atualização, eu havia enforcado [upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.10.1"...
também
@kvaps @stealthybox isso é mais provável etcd
questão ( kubeadm
fala planície HTTP/2
para TLS habilitado etcd
), eu batê-lo também. Veja este outro problema: https://github.com/kubernetes/kubeadm/issues/755
Honestamente, não consigo entender por que a mesma porta TCP é usada para ouvintes TLS e não-TLS etcd
; ela só causa problemas como este. Ficando claro, a velha _conexão recusada_ daria uma dica imediata, aqui eu tive que recorrer a tcpdump
para entender o que está acontecendo.
OH!
Sim, você está certo, isso só funciona com o meu patch TLS local para a verificação de status Etcd.
Faça isso para concluir a atualização:
kubeadm alpha phase controlplane all
kubeadm alpha phase upload-config
editou a solução alternativa acima para ficar correta
@stealthybox o segundo comando kubeadm não funciona:
# kubeadm alpha phase upload-config
The --config flag is mandatory
@renich basta fornecer o caminho de arquivo de sua configuração
Se você não usar nenhuma configuração personalizada, pode passar um arquivo vazio.
Esta é uma maneira simples de fazer isso no bash:
1.10_kubernetes/server/bin/kubeadm alpha phase upload-config --config <(echo)
Isso agora deve ser resolvido com a fusão de https://github.com/kubernetes/kubernetes/pull/62655 e fará parte da versão v1.10.2.
Posso confirmar que a atualização 1.10.0 -> 1.10.2 com kubeadm 1.10.2 foi tranquila, sem tempo limite
Ainda tenho tempo limite em 1.10.0 -> 1.10.2, mas outro:
[upgrade/staticpods] Waiting for the kubelet to restart the component
Static pod: kube-apiserver-master hash: a273591d3207fcd9e6fd0c308cc68d64
[upgrade/apply] FATAL: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: [timed out waiting for the condition]
Eu não tenho certeza do que fazer...
@ denis111 verifique os logs do servidor API ao fazer a atualização usando docker ps
. Acho que você pode estar enfrentando um problema que também estou enfrentando.
@dvdmuckle Bem, não vejo nenhum erro
E eu acho que o hash do kube-apiserver não muda durante a atualização.
Eu tenho um cluster ARM64 em 1.9.3 e atualizei com sucesso para 1.9.7, mas tive o mesmo problema de tempo limite para atualizar de 1.9.7 para 1.10.2.
Eu até tentei editar e recompilar o kubeadm aumentando os tempos limites (como esses últimos commits https://github.com/anguslees/kubernetes/commits/kubeadm-gusfork) com os mesmos resultados.
$ sudo kubeadm upgrade apply v1.10.2 --force
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[upgrade/version] You have chosen to change the cluster version to "v1.10.2"
[upgrade/versions] Cluster version: v1.9.7
[upgrade/versions] kubeadm version: v1.10.2-dirty
[upgrade/version] Found 1 potential version compatibility errors but skipping since the --force flag is set:
- Specified version to upgrade to "v1.10.2" is higher than the kubeadm version "v1.10.2-dirty". Upgrade kubeadm first using the tool you used to install kubeadm
[upgrade/prepull] Will prepull images for components [kube-apiserver kube-controller-manager kube-scheduler]
[upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.10.2"...
Static pod: kube-apiserver-kubemaster1 hash: ed7578d5bf9314188dca798386bcfb0e
Static pod: kube-controller-manager-kubemaster1 hash: e0c3f578f1c547dcf9996e1d3390c10c
Static pod: kube-scheduler-kubemaster1 hash: 52e767858f52ac4aba448b1a113884ee
[upgrade/etcd] Upgrading to TLS for etcd
Static pod: etcd-kubemaster1 hash: 413224efa82e36533ce93e30bd18e3a8
[etcd] Wrote Static Pod manifest for a local etcd instance to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests346927148/etcd.yaml"
[certificates] Using the existing etcd/ca certificate and key.
[certificates] Using the existing etcd/server certificate and key.
[certificates] Using the existing etcd/peer certificate and key.
[certificates] Using the existing etcd/healthcheck-client certificate and key.
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/etcd.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests190581659/etcd.yaml"
[upgrade/staticpods] Not waiting for pod-hash change for component "etcd"
[upgrade/etcd] Waiting for etcd to become available
[util/etcd] Waiting 30s for initial delay
[util/etcd] Attempting to get etcd status 1/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: getsockopt: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to get etcd status 2/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: getsockopt: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to get etcd status 3/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: getsockopt: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to get etcd status 4/10
[upgrade/staticpods] Writing new Static Pod manifests to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests346927148"
[controlplane] Wrote Static Pod manifest for component kube-apiserver to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests346927148/kube-apiserver.yaml"
[controlplane] Wrote Static Pod manifest for component kube-controller-manager to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests346927148/kube-controller-manager.yaml"
[controlplane] Wrote Static Pod manifest for component kube-scheduler to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests346927148/kube-scheduler.yaml"
[upgrade/staticpods] The etcd manifest will be restored if component "kube-apiserver" fails to upgrade
[certificates] Using the existing etcd/ca certificate and key.
[certificates] Using the existing apiserver-etcd-client certificate and key.
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-apiserver.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests190581659/kube-apiserver.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/apply] FATAL: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: [timed out waiting for the condition]
Atualizar v1.10.2 -> v1.10.2 (o que pode ser um absurdo. Apenas testando ...)
Ubuntu 16.04.
E falha com um erro.
kubeadm upgrade apply v1.10.2
[upgrade/apply] FATAL: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: [timed out waiting for the condition]
Eu me pergunto se isso ainda é rastreado em algum problema ... não foi possível encontrar.
Também estou vendo atualizações ainda falhando com o erro timed out waiting for the condition
.
Editar: discussão movida para um novo tíquete https://github.com/kubernetes/kubeadm/issues/850 , discuta lá.
Se alguém mais tiver esse problema com 1.9.x:
Se você estiver no aws com nomes de host personalizados, será necessário editar o kubeadm-config configmap e definir em nodeName o nome interno do aws: ip-xx-xx-xx-xx. $ REGION.compute.internal)
kubectl -n kube-system edit cm kubeadm-config -oyaml
Isso além de definir o cliente etc como http. Ainda não estou nas versões de cartas para ver se eles consertaram isso.
Isso ocorre porque o kubeadm tenta ler este caminho na api: / api / v1 / namespaces / kube-system / pods / kube-apiserver- $ NodeName
Como o tempo limite foi aumentado em 1.10.6, atualizei com sucesso minha implantação 1.9.7 para 1.10.6 algumas semanas atrás.
Planejando atualizar para 1.11.2 assim que os pacotes .deb estiverem prontos, já que as mesmas mudanças estão nesta versão.
Meu cluster é executado localmente em placas ARM64.
Comentários muito úteis
A solução temporária é garantir certificados e atualizar os pods etcd e apiserver, ignorando as verificações.
Certifique-se de verificar seu Config e adicionar sinalizadores para seu caso de uso: