LAPORAN BUG
versi kubeadm (gunakan kubeadm version
):
versi kubeadm: & version.Info {Mayor: "1", Minor: "10", GitVersion: "v1.10.0", GitCommit: "fc32d2f3698e36b93322a3465f63a14e9f0eaead", GitTreeState: "clean", BuildDate: "2018-03-26T16: 44: 10Z ", GoVersion:" go1.9.3 ", Penyusun:" gc ", Platform:" linux / amd64 "}
Lingkungan :
kubectl version
):Versi Klien: version.Info {Major: "1", Minor: "9", GitVersion: "v1.9.6", GitCommit: "9f8ebd171479bec0ada837d7ee641dec2f8c6dd1", GitTreeState: "clean", BuildDate: "2018-03-21T15: 21: 50Z ", GoVersion:" go1.9.3 ", Penyusun:" gc ", Platform:" linux / amd64 "}
Versi Server: version.Info {Mayor: "1", Minor: "9", GitVersion: "v1.9.6", GitCommit: "9f8ebd171479bec0ada837d7ee641dec2f8c6dd1", GitTreeState: "clean", BuildDate: "2018-03-21T15: 13: 31Z ", GoVersion:" go1.9.3 ", Penyusun:" gc ", Platform:" linux / amd64 "}
Scaleway baremetal C2S
Ubuntu Xenial (16.04 LTS) (GNU / Linux 4.4.122-mainline-rev1 x86_64)
uname -a
):Linux amd64-master-1 4.4.122-mainline-rev1 # 1 SMP Min 18 Mar 10:44:19 UTC 2018 x86_64 x86_64 x86_64 GNU / Linux
Mencoba meningkatkan dari 1.9.6 ke 1.10.0 Saya mendapatkan kesalahan ini:
kubeadm upgrade apply v1.10.0
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[upgrade/version] You have chosen to change the cluster version to "v1.10.0"
[upgrade/versions] Cluster version: v1.9.6
[upgrade/versions] kubeadm version: v1.10.0
[upgrade/confirm] Are you sure you want to proceed with the upgrade? [y/N]: y
[upgrade/prepull] Will prepull images for components [kube-apiserver kube-controller-manager kube-scheduler]
[upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.10.0"...
[etcd] Wrote Static Pod manifest for a local etcd instance to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests411909119/etcd.yaml"
[certificates] Generated etcd/ca certificate and key.
[certificates] Generated etcd/server certificate and key.
[certificates] etcd/server serving cert is signed for DNS names [localhost] and IPs [127.0.0.1]
[certificates] Generated etcd/peer certificate and key.
[certificates] etcd/peer serving cert is signed for DNS names [arm-master-1] and IPs [10.1.244.57]
[certificates] Generated etcd/healthcheck-client certificate and key.
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/etcd.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests180476754/etcd.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/apply] FATAL: fatal error when trying to upgrade the etcd cluster: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: [timed out waiting for the condition], rolled the state back to pre-upgrade state
Peningkatan berhasil
Instal paket 1.9.6 dan init cluster 1.9.6:
curl -fsSL https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
echo "deb http://apt.kubernetes.io/ kubernetes-xenial main" | tee /etc/apt/sources.list.d/kubernetes.list
apt-get update -qq
apt-get install -qy kubectl=1.9.6-00
apt-get install -qy kubelet=1.9.6-00
apt-get install -qy kubeadm=1.9.6-00
Edit kubeadm-config dan ubah featureGates dari string ke map seperti yang dilaporkan di https://github.com/kubernetes/kubernetes/issues/61764 .
kubectl -n kube-system edit cm kubeadm-config
....
featureGates: {}
....
Unduh kubeadm 1.10.0 dan jalankan kubeadm upgrade plan
dan kubeadm upgrade apply v1.10.0
.
Bekerja untuk mereproduksi bug ini secara lokal.
Setelah mencoba ulang ini selama 10 kali akhirnya berhasil
Inilah diff manifest etcd saya
`` root @ vagrant : ~ # diff /etc/kubernetes/manifests/etcd.yaml /etc/kubernetes/tmp/kubeadm-backup-manifests858209931/etcd.yaml
16,17c16,17
<- --listen-client-urls = https://127.0.0.1 : 2379
- --listen-client-urls=http://127.0.0.1:2379 - --advertise-client-urls=http://127.0.0.1:2379
19,27c19
<- --key-file = / etc / kubernetes / pki / etcd / server.key
<- --trusted-ca-file = / etc / kubernetes / pki / etcd / ca.crt
<- --peer-cert-file = / etc / kubernetes / pki / etcd / peer.crt
<- --peer-key-file = / etc / kubernetes / pki / etcd / peer.key
<- --client-cert-auth = true
<- --peer-client-cert-auth = true
<- --cert-file = / etc / kubernetes / pki / etcd / server.crt
<- --peer-trusted-ca-file = / etc / kubernetes / pki / etcd / ca.crt<image: gcr.io/google_containers/etcd-amd64:3.1.12
image: gcr.io/google_containers/etcd-amd64:3.1.11
29,35d20
<exec:
<perintah:
<- / bin / sh
<- -ec
<- ETCDCTL_API = 3 etcdctl --endpoints = 127.0.0.1: 2379 --cacert = / etc / kubernetes / pki / etcd / ca.crt
<--cert = / etc / kubernetes / pki / etcd / healthcheck-client.crt --key = / etc / kubernetes / pki / etcd / healthcheck-client.key
<get foo
36a22,26
httpGet:
tuan rumah: 127.0.0.1
jalur: / kesehatan
port: 2379
skema: HTTP
43,45c33
<nama: etcd-data
<- mountPath: / etc / kubernetes / pki / etcd<nama: etcd-certs
name: etcd
51,55c39
<nama: etcd-data
<- hostPath:
<path: / etc / kubernetes / pki / etcd
<type: DirectoryOrCreate<nama: etcd-certs
name: etcd
root @ gelandangan : ~ # ls / etc / kubernetes / pki / etcd
ca.crt ca.key healthcheck-client.crt healthcheck-client.key peer.crt peer.key server.crt server.key```
1.9.6 cluster di Ubuntu 17.10 Vagrant:
root<strong i="6">@vagrant</strong>:/vagrant# 1.10_kubernetes/server/bin/kubeadm upgrade apply v1.10.0
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[upgrade/version] You have chosen to change the cluster version to "v1.10.0"
[upgrade/versions] Cluster version: v1.9.6
[upgrade/versions] kubeadm version: v1.10.0
[upgrade/confirm] Are you sure you want to proceed with the upgrade? [y/N]: y
[upgrade/prepull] Will prepull images for components [kube-apiserver kube-controller-manager kube-scheduler]
[upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.10.0"...
[etcd] Wrote Static Pod manifest for a local etcd instance to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests262738652/etcd.yaml"
[certificates] Generated etcd/ca certificate and key.
[certificates] Generated etcd/server certificate and key.
[certificates] etcd/server serving cert is signed for DNS names [localhost] and IPs [127.0.0.1]
[certificates] Generated etcd/peer certificate and key.
[certificates] etcd/peer serving cert is signed for DNS names [vagrant] and IPs [10.0.2.15]
[certificates] Generated etcd/healthcheck-client certificate and key.
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/etcd.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests858209931/etcd.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[apiclient] Error getting Pods with label selector "component=etcd" [the server was unable to return a response in the time allotted, but may still be processing the request (get pods)]
[apiclient] Error getting Pods with label selector "component=etcd" [Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods?labelSelector=component%3Detcd: http2: server sent GOAWAY and closed the connection; LastStreamID=27, ErrCode=NO_ERROR, debug=""]
[apiclient] Error getting Pods with label selector "component=etcd" [Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods?labelSelector=component%3Detcd: net/http: TLS handshake timeout]
[apiclient] Error getting Pods with label selector "component=etcd" [the server was unable to return a response in the time allotted, but may still be processing the request (get pods)]
[apiclient] Error getting Pods with label selector "component=etcd" [Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods?labelSelector=component%3Detcd: http2: server sent GOAWAY and closed the connection; LastStreamID=3, ErrCode=NO_ERROR, debug=""]
[upgrade/apply] FATAL: fatal error when trying to upgrade the etcd cluster: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: [timed out waiting for the condition], rolled the state back to pre-upgrade state
Ini adalah lingkungan repro saya: https://github.com/stealthybox/vagrant-kubeadm-testing
Ubah baris ini menjadi 1.9.6-00
untuk bootstrap: https://github.com/stealthybox/vagrant-kubeadm-testing/blob/9d4493e990c9bd742107b317641267c3ef3640cd/Vagrantfile#L18 -L20
Kemudian unduh binari server 1,10 ke dalam repo, dan mereka akan tersedia untuk tamu dalam /vagrant
https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.10.md#server -binaries
kubelet etcd terkait log:
root<strong i="6">@vagrant</strong>:~# journalctl -xefu kubelet | grep -i etcd
Mar 28 16:32:07 vagrant kubelet[14676]: W0328 16:32:07.808776 14676 status_manager.go:459] Failed to get status for pod "etcd-vagrant_kube-system(7278f85057e8bf5cb81c9f96d3b25320)": Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 16:32:07 vagrant kubelet[14676]: I0328 16:32:07.880412 14676 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "etcd" (UniqueName: "kubernetes.io/host-path/7278f85057e8bf5cb81c9f96d3b25320-etcd") pod "etcd-vagrant" (UID: "7278f85057e8bf5cb81c9f96d3b25320")
Mar 28 16:34:27 vagrant kubelet[14676]: W0328 16:34:27.472534 14676 status_manager.go:459] Failed to get status for pod "etcd-vagrant_kube-system(7278f85057e8bf5cb81c9f96d3b25320)": Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 16:57:33 vagrant kubelet[14676]: W0328 16:57:33.683648 14676 kubelet.go:1597] Deleting mirror pod "etcd-vagrant_kube-system(122348c3-32a6-11e8-8dc5-080027d6be16)" because it is outdated
Mar 28 16:57:33 vagrant kubelet[14676]: I0328 16:57:33.725564 14676 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "etcd-certs" (UniqueName: "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-certs") pod "etcd-vagrant" (UID: "37936d2107e31b457cada6c2433469f1")
Mar 28 16:57:33 vagrant kubelet[14676]: I0328 16:57:33.725637 14676 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "etcd-data" (UniqueName: "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-data") pod "etcd-vagrant" (UID: "37936d2107e31b457cada6c2433469f1")
Mar 28 16:57:35 vagrant kubelet[14676]: E0328 16:57:35.484901 14676 kuberuntime_container.go:66] Can't make a ref to pod "etcd-vagrant_kube-system(7278f85057e8bf5cb81c9f96d3b25320)", container etcd: selfLink was empty, can't make reference
Mar 28 16:57:35 vagrant kubelet[14676]: I0328 16:57:35.889458 14676 reconciler.go:191] operationExecutor.UnmountVolume started for volume "etcd" (UniqueName: "kubernetes.io/host-path/7278f85057e8bf5cb81c9f96d3b25320-etcd") pod "7278f85057e8bf5cb81c9f96d3b25320" (UID: "7278f85057e8bf5cb81c9f96d3b25320")
Mar 28 16:57:35 vagrant kubelet[14676]: I0328 16:57:35.889595 14676 operation_generator.go:643] UnmountVolume.TearDown succeeded for volume "kubernetes.io/host-path/7278f85057e8bf5cb81c9f96d3b25320-etcd" (OuterVolumeSpecName: "etcd") pod "7278f85057e8bf5cb81c9f96d3b25320" (UID: "7278f85057e8bf5cb81c9f96d3b25320"). InnerVolumeSpecName "etcd". PluginName "kubernetes.io/host-path", VolumeGidValue ""
Mar 28 16:57:35 vagrant kubelet[14676]: I0328 16:57:35.989892 14676 reconciler.go:297] Volume detached for volume "etcd" (UniqueName: "kubernetes.io/host-path/7278f85057e8bf5cb81c9f96d3b25320-etcd") on node "vagrant" DevicePath ""
Mar 28 16:58:03 vagrant kubelet[14676]: E0328 16:58:03.688878 14676 mirror_client.go:88] Failed deleting a mirror pod "etcd-vagrant_kube-system": Timeout: request did not complete within allowed duration
Mar 28 16:58:03 vagrant kubelet[14676]: E0328 16:58:03.841447 14676 event.go:200] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"etcd-vagrant.152023ff626cfbc5", GenerateName:"", Namespace:"kube-system", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"etcd-vagrant", UID:"37936d2107e31b457cada6c2433469f1", APIVersion:"v1", ResourceVersion:"", FieldPath:""}, Reason:"SuccessfulMountVolume", Message:"MountVolume.SetUp succeeded for volume \"etcd-certs\" ", Source:v1.EventSource{Component:"kubelet", Host:"vagrant"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbea7103f713e59c5, ext:1534226953099, loc:(*time.Location)(0x5859e60)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbea7103f713e59c5, ext:1534226953099, loc:(*time.Location)(0x5859e60)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Timeout: request did not complete within allowed duration' (will not retry!)
Mar 28 16:58:33 vagrant kubelet[14676]: E0328 16:58:33.844276 14676 event.go:200] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"etcd-vagrant.152023ff626cfb82", GenerateName:"", Namespace:"kube-system", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"etcd-vagrant", UID:"37936d2107e31b457cada6c2433469f1", APIVersion:"v1", ResourceVersion:"", FieldPath:""}, Reason:"SuccessfulMountVolume", Message:"MountVolume.SetUp succeeded for volume \"etcd-data\" ", Source:v1.EventSource{Component:"kubelet", Host:"vagrant"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbea7103f713e5982, ext:1534226953033, loc:(*time.Location)(0x5859e60)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbea7103f713e5982, ext:1534226953033, loc:(*time.Location)(0x5859e60)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Timeout: request did not complete within allowed duration' (will not retry!)
Mar 28 16:59:03 vagrant kubelet[14676]: E0328 16:59:03.692450 14676 kubelet.go:1612] Failed creating a mirror pod for "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)": the server was unable to return a response in the time allotted, but may still be processing the request (post pods)
Mar 28 16:59:03 vagrant kubelet[14676]: E0328 16:59:03.848007 14676 event.go:200] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"etcd-vagrant.152023ff641f915f", GenerateName:"", Namespace:"kube-system", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"etcd-vagrant", UID:"7278f85057e8bf5cb81c9f96d3b25320", APIVersion:"v1", ResourceVersion:"", FieldPath:"spec.containers{etcd}"}, Reason:"Killing", Message:"Killing container with id docker://etcd:Need to kill Pod", Source:v1.EventSource{Component:"kubelet", Host:"vagrant"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbea7103f72f0ef5f, ext:1534255433999, loc:(*time.Location)(0x5859e60)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbea7103f72f0ef5f, ext:1534255433999, loc:(*time.Location)(0x5859e60)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Timeout: request did not complete within allowed duration' (will not retry!)
Mar 28 16:59:14 vagrant kubelet[14676]: W0328 16:59:14.472661 14676 kubelet.go:1597] Deleting mirror pod "etcd-vagrant_kube-system(122348c3-32a6-11e8-8dc5-080027d6be16)" because it is outdated
Mar 28 16:59:14 vagrant kubelet[14676]: W0328 16:59:14.473138 14676 status_manager.go:459] Failed to get status for pod "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)": Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 16:59:14 vagrant kubelet[14676]: E0328 16:59:14.473190 14676 mirror_client.go:88] Failed deleting a mirror pod "etcd-vagrant_kube-system": Delete https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 16:59:14 vagrant kubelet[14676]: E0328 16:59:14.473658 14676 kubelet.go:1612] Failed creating a mirror pod for "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)": Post https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 16:59:15 vagrant kubelet[14676]: W0328 16:59:15.481336 14676 kubelet.go:1597] Deleting mirror pod "etcd-vagrant_kube-system(122348c3-32a6-11e8-8dc5-080027d6be16)" because it is outdated
Mar 28 16:59:15 vagrant kubelet[14676]: E0328 16:59:15.483705 14676 mirror_client.go:88] Failed deleting a mirror pod "etcd-vagrant_kube-system": Delete https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 16:59:15 vagrant kubelet[14676]: E0328 16:59:15.497391 14676 kubelet.go:1612] Failed creating a mirror pod for "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)": Post https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 17:00:34 vagrant kubelet[14676]: W0328 17:00:34.475851 14676 kubelet.go:1597] Deleting mirror pod "etcd-vagrant_kube-system(122348c3-32a6-11e8-8dc5-080027d6be16)" because it is outdated
Mar 28 17:01:07 vagrant kubelet[14676]: W0328 17:01:07.720076 14676 status_manager.go:459] Failed to get status for pod "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)": Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: http2: server sent GOAWAY and closed the connection; LastStreamID=47, ErrCode=NO_ERROR, debug=""
Mar 28 17:01:07 vagrant kubelet[14676]: E0328 17:01:07.720107 14676 mirror_client.go:88] Failed deleting a mirror pod "etcd-vagrant_kube-system": Delete https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: http2: server sent GOAWAY and closed the connection; LastStreamID=47, ErrCode=NO_ERROR, debug=""; some request body already written
Mar 28 17:01:07 vagrant kubelet[14676]: E0328 17:01:07.725335 14676 kubelet.go:1612] Failed creating a mirror pod for "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)": Post https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 17:01:07 vagrant kubelet[14676]: I0328 17:01:07.728709 14676 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "etcd" (UniqueName: "kubernetes.io/host-path/7278f85057e8bf5cb81c9f96d3b25320-etcd") pod "etcd-vagrant" (UID: "7278f85057e8bf5cb81c9f96d3b25320")
Mar 28 17:01:07 vagrant kubelet[14676]: W0328 17:01:07.734475 14676 status_manager.go:459] Failed to get status for pod "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)": Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 17:01:07 vagrant kubelet[14676]: W0328 17:01:07.740642 14676 status_manager.go:459] Failed to get status for pod "etcd-vagrant_kube-system(7278f85057e8bf5cb81c9f96d3b25320)": Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 17:01:09 vagrant kubelet[14676]: E0328 17:01:09.484412 14676 kuberuntime_container.go:66] Can't make a ref to pod "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)", container etcd: selfLink was empty, can't make reference
Mar 28 17:01:09 vagrant kubelet[14676]: I0328 17:01:09.848794 14676 reconciler.go:191] operationExecutor.UnmountVolume started for volume "etcd-certs" (UniqueName: "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-certs") pod "37936d2107e31b457cada6c2433469f1" (UID: "37936d2107e31b457cada6c2433469f1")
Mar 28 17:01:09 vagrant kubelet[14676]: I0328 17:01:09.849282 14676 reconciler.go:191] operationExecutor.UnmountVolume started for volume "etcd-data" (UniqueName: "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-data") pod "37936d2107e31b457cada6c2433469f1" (UID: "37936d2107e31b457cada6c2433469f1")
Mar 28 17:01:09 vagrant kubelet[14676]: I0328 17:01:09.849571 14676 operation_generator.go:643] UnmountVolume.TearDown succeeded for volume "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-data" (OuterVolumeSpecName: "etcd-data") pod "37936d2107e31b457cada6c2433469f1" (UID: "37936d2107e31b457cada6c2433469f1"). InnerVolumeSpecName "etcd-data". PluginName "kubernetes.io/host-path", VolumeGidValue ""
Mar 28 17:01:09 vagrant kubelet[14676]: I0328 17:01:09.849503 14676 operation_generator.go:643] UnmountVolume.TearDown succeeded for volume "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-certs" (OuterVolumeSpecName: "etcd-certs") pod "37936d2107e31b457cada6c2433469f1" (UID: "37936d2107e31b457cada6c2433469f1"). InnerVolumeSpecName "etcd-certs". PluginName "kubernetes.io/host-path", VolumeGidValue ""
Mar 28 17:01:09 vagrant kubelet[14676]: I0328 17:01:09.949925 14676 reconciler.go:297] Volume detached for volume "etcd-certs" (UniqueName: "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-certs") on node "vagrant" DevicePath ""
Mar 28 17:01:09 vagrant kubelet[14676]: I0328 17:01:09.949975 14676 reconciler.go:297] Volume detached for volume "etcd-data" (UniqueName: "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-data") on node "vagrant" DevicePath ""
Solusi saat ini adalah terus mencoba kembali peningkatan dan pada titik tertentu itu akan berhasil.
@stealthybox Apakah Anda kebetulan mendapatkan log keluar dari buruh pelabuhan untuk kontainer etcd? juga, grep -i etcd
mungkin menutupi beberapa keluaran kubelet, misalnya beberapa pesan kesalahan yang tidak memiliki nama kontainer di dalamnya, tetapi masih relevan.
Saya baru saja menemukan kasus tepi aneh lain yang terkait dengan bug ini. Pemutakhiran kubeadm menandai pemutakhiran etcd selesai sebelum gambar etcd baru ditarik dan pod statis baru diterapkan. Hal ini menyebabkan pemutakhiran ke waktu tunggu di langkah selanjutnya dan pembatalan pemutakhiran gagal. Ini juga membuat cluster dalam keadaan rusak. Diperlukan pemulihan manifes pod statis etcd asli untuk memulihkan cluster.
Oh ya saya juga terjebak di sana. cluster saya benar-benar down. Dapatkah seseorang membagikan beberapa instruksi tentang cara menyelamatkan dari keadaan ini?
Berada di sana pada upaya kedua saya untuk meningkatkan, seperti yang dijelaskan @detiber , cukup menyakitkan. :menangis:
Menemukan beberapa hal yang dicadangkan di / etc / kubernetes / tmp, merasa bahwa etcd mungkin pelakunya, saya menyalin manifes lama ke yang baru di folder manifests. Pada saat itu saya tidak akan rugi apa-apa, karena saya benar-benar kehilangan kendali atas cluster tersebut. Kemudian, saya tidak ingat persis, tapi saya pikir saya me-restart seluruh mesin, dan kemudian menurunkan semua hal kembali ke v1.9.6. Akhirnya, saya menguasai cluster dan kehilangan motivasi untuk mengacaukan v1.10.0 lagi. Itu sama sekali tidak menyenangkan ...
Jika Anda mengembalikan manifes pod statis etcd dari /etc/kubernetes/tmp
, penting juga untuk mengembalikan manifes apiserver ke versi 1.9 karena konfigurasi TLS baru di 1.10.
^ Anda mungkin tidak perlu melakukan ini karena saya percaya peningkatan etcd memblokir sisa dari peningkatan controlplane.
Tampaknya hanya manifes etcd yang tidak dapat diputar kembali pada peningkatan yang gagal, yang lainnya baik-baik saja. Setelah memindahkan manifes cadangan dan memulai ulang kubelet, semuanya kembali baik-baik saja.
Saya menghadapi masalah batas waktu yang sama dan kubeadm memutar kembali manifes kube-apiserv ke 1.9.6, tetapi meninggalkan manifest etcd apa adanya (baca: dengan TLS diaktifkan), jelas menyebabkan apiserv gagal total, secara efektif merusak node master saya. Calon yang baik untuk laporan masalah terpisah, saya kira.
@dvdmuckle @codepainters , sayangnya itu tergantung pada komponen mana yang mencapai kondisi balapan (etcd atau server api) apakah rollback berhasil. Saya menemukan perbaikan untuk kondisi balapan, tetapi itu benar-benar merusak peningkatan kubeadm. Saya bekerja dengan @stealthybox untuk mencoba dan menemukan jalan yang tepat untuk memperbaiki pemutakhiran dengan benar.
@codemainters Saya pikir itu adalah masalah yang sama.
Ada beberapa masalah mendasar yang menyebabkan masalah ini:
Akibatnya, upgrade hanya berhasil saat ini terjadi jika ada update status pod untuk pod etcd yang menyebabkan hash berubah sebelum kubelet mengambil manifest statis baru untuk etcd. Selain itu, server api harus tetap tersedia untuk bagian pertama dari peningkatan apiserver saat perkakas pemutakhiran meminta api sebelum memperbarui manifes apiserver.
@detiber dan saya mendapat telepon untuk membahas perubahan yang perlu kami lakukan pada proses peningkatan.
Kami berencana untuk menerapkan 3 perbaikan untuk bug ini dalam rilis patch 1.10.x :
Hapus etcd TLS dari upgrade.
Loop pemutakhiran saat ini melakukan modifikasi batch per komponen secara serial.
Mengupgrade komponen tidak memiliki pengetahuan tentang konfigurasi komponen dependen.
Memverifikasi peningkatan membutuhkan APIServer tersedia untuk memeriksa status pod.
Etcd TLS memerlukan perubahan konfigurasi etcd + apiserver gabungan yang memutuskan kontrak ini.
Ini adalah perubahan minimum yang dapat dilakukan untuk memperbaiki masalah ini, dan meninggalkan cluster yang ditingkatkan dengan etcd tidak aman.
Perbaiki kondisi balapan hash mirror-pod pada perubahan status pod.
https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/phases/upgrade/staticpods.go#L189.
Upgrade sekarang akan benar dengan asumsi kompatibilitas antara flag etcd dan apiserver.
Tingkatkan TLS secara khusus dalam fase terpisah.
Etcd dan APIServer perlu ditingkatkan bersama.
kubeadm alpha phase ensure-etcd-tls
?.
Fase ini harus dapat dijalankan secara independen dari peningkatan klaster.
Selama pemutakhiran cluster, fase ini harus dijalankan sebelum memperbarui semua komponen.
Untuk 1.11 kami ingin:
alternatif: Gunakan CRI untuk mendapatkan info pod (dapat didemokan menggunakan crictl
).
peringatan: CRI pada dockershim dan kemungkinan runtime container lainnya saat ini tidak mendukung kompatibilitas mundur untuk perubahan yang merusak CRI.
PR untuk mengatasi kondisi balapan update pod statis: https://github.com/kubernetes/kubernetes/pull/61942
cherry-pick PR untuk rilis-1.10 cabang: https://github.com/kubernetes/kubernetes/pull/61954
@detiber apakah Anda keberatan menjelaskan kondisi ras apa yang sedang kita bicarakan? Saya tidak begitu akrab dengan internal kubeadm, namun kedengarannya menarik.
@codepainters lihat https://github.com/kubernetes/kubeadm/issues/740#issuecomment -377263347
FYI - masalah yang sama / masalah yang ditingkatkan dari 1.9.3
Mencoba solusi untuk mencoba kembali beberapa kali. Akhirnya mencapai kondisi balapan dengan server API dan peningkatan tidak dapat dibatalkan.
@stealthybox thx, saya tidak mengerti saat pertama membaca.
Saya mengalami masalah yang sama .. [ERROR APIServerHealth]: Server API tidak sehat; / healthz tidak mengembalikan "oke"
[ERROR MasterNodesReady]: tidak dapat mencantumkan master dalam cluster: Dapatkan https ....... saat mengupgrade. Tolong bantu saya dengan ini. Saya meningkatkan dari 1.9.3 ke 1.10.0. Awalnya, ia bisa mencapai titik tertentu dari "[upgrade / staticpods] Menunggu kubelet untuk merestart komponen".
Solusi sementara adalah memastikan sertifikat dan mengupgrade pod etcd dan apiserver dengan melewati pemeriksaan.
Pastikan untuk memeriksa Config Anda dan tambahkan flag apa pun untuk kasus penggunaan Anda:
kubectl -n kube-system edit cm kubeadm-config # change featureFlags
...
featureGates: {}
...
kubeadm alpha phase certs all
kubeadm alpha phase etcd local
kubeadm alpha phase controlplane all
kubeadm alpha phase upload-config
Terima kasih @stealthybox
Bagi saya proses upgrade apply
terhenti pada [upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.10.1"...
namun cluster berhasil ditingkatkan.
@stealthybox Saya tidak yakin, tetapi tampaknya ada sesuatu yang rusak setelah langkah-langkah ini, karena
kubeadm upgrade plan
hang setelah itu:
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[upgrade] Fetching available versions to upgrade to
[upgrade/versions] Cluster version: v1.10.1
[upgrade/versions] kubeadm version: v1.10.1
[upgrade/versions] Latest stable version: v1.10.1
Ketika menerapkan pembaruan saya telah menggantung [upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.10.1"...
juga
@kvaps @stealthybox ini kemungkinan besar etcd
masalah ( kubeadm
berbicara biasa HTTP/2
untuk TLS-enabled etcd
), saya juga memukulnya. Lihat masalah lain ini: https://github.com/kubernetes/kubeadm/issues/755
Sejujurnya, saya tidak mengerti mengapa port TCP yang sama digunakan untuk pendengar TLS dan non-TLS etcd
, itu hanya menyebabkan masalah seperti ini. Semakin jelas, _koneksi yang lama ditolak_ akan memberi petunjuk langsung, di sini saya harus menggunakan tcpdump
untuk memahami apa yang sedang terjadi.
OH!
Tembak Anda benar, itu hanya berfungsi dengan tambalan TLS lokal saya untuk pemeriksaan status Etcd.
Lakukan ini untuk menyelesaikan peningkatan:
kubeadm alpha phase controlplane all
kubeadm alpha phase upload-config
mengedit solusi di atas menjadi benar
@stealthybox perintah kubeadm kedua tidak berfungsi:
# kubeadm alpha phase upload-config
The --config flag is mandatory
@renich berikan saja jalur file konfigurasi Anda
Jika Anda tidak menggunakan pengaturan kustom apa pun, Anda dapat memberikannya file kosong.
Berikut cara sederhana untuk melakukannya di bash:
1.10_kubernetes/server/bin/kubeadm alpha phase upload-config --config <(echo)
Ini sekarang harus diselesaikan dengan penggabungan https://github.com/kubernetes/kubernetes/pull/62655 dan akan menjadi bagian dari rilis v1.10.2.
Saya dapat mengonfirmasi bahwa peningkatan 1.10.0 -> 1.10.2 dengan kubeadm 1.10.2 lancar, tidak ada batas waktu
Saya masih memiliki waktu tunggu di 1.10.0 -> 1.10.2 tetapi yang lain:
[upgrade/staticpods] Waiting for the kubelet to restart the component
Static pod: kube-apiserver-master hash: a273591d3207fcd9e6fd0c308cc68d64
[upgrade/apply] FATAL: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: [timed out waiting for the condition]
Saya tidak yakin apa yang harus saya lakukan ...
@ denis111 periksa log server API saat melakukan pemutakhiran menggunakan docker ps
. Saya merasa Anda mungkin mengalami masalah yang juga saya alami.
@dvdmuckle Ya, saya tidak melihat kesalahan apa pun di log itu, hanya entri yang dimulai dengan I dan beberapa W.
Dan saya pikir hash dari kube-apiserver tidak berubah selama peningkatan.
Saya memiliki cluster ARM64 di 1.9.3 dan berhasil memperbarui ke 1.9.7 tetapi mendapat masalah batas waktu yang sama untuk meningkatkan dari 1.9.7 ke 1.10.2.
Saya bahkan mencoba mengedit dan mengkompilasi ulang kubeadm meningkatkan batas waktu (seperti komitmen terakhir ini https://github.com/anguslees/kubernetes/commits/kubeadm-gusfork) dengan hasil yang sama.
$ sudo kubeadm upgrade apply v1.10.2 --force
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[upgrade/version] You have chosen to change the cluster version to "v1.10.2"
[upgrade/versions] Cluster version: v1.9.7
[upgrade/versions] kubeadm version: v1.10.2-dirty
[upgrade/version] Found 1 potential version compatibility errors but skipping since the --force flag is set:
- Specified version to upgrade to "v1.10.2" is higher than the kubeadm version "v1.10.2-dirty". Upgrade kubeadm first using the tool you used to install kubeadm
[upgrade/prepull] Will prepull images for components [kube-apiserver kube-controller-manager kube-scheduler]
[upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.10.2"...
Static pod: kube-apiserver-kubemaster1 hash: ed7578d5bf9314188dca798386bcfb0e
Static pod: kube-controller-manager-kubemaster1 hash: e0c3f578f1c547dcf9996e1d3390c10c
Static pod: kube-scheduler-kubemaster1 hash: 52e767858f52ac4aba448b1a113884ee
[upgrade/etcd] Upgrading to TLS for etcd
Static pod: etcd-kubemaster1 hash: 413224efa82e36533ce93e30bd18e3a8
[etcd] Wrote Static Pod manifest for a local etcd instance to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests346927148/etcd.yaml"
[certificates] Using the existing etcd/ca certificate and key.
[certificates] Using the existing etcd/server certificate and key.
[certificates] Using the existing etcd/peer certificate and key.
[certificates] Using the existing etcd/healthcheck-client certificate and key.
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/etcd.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests190581659/etcd.yaml"
[upgrade/staticpods] Not waiting for pod-hash change for component "etcd"
[upgrade/etcd] Waiting for etcd to become available
[util/etcd] Waiting 30s for initial delay
[util/etcd] Attempting to get etcd status 1/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: getsockopt: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to get etcd status 2/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: getsockopt: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to get etcd status 3/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: getsockopt: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to get etcd status 4/10
[upgrade/staticpods] Writing new Static Pod manifests to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests346927148"
[controlplane] Wrote Static Pod manifest for component kube-apiserver to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests346927148/kube-apiserver.yaml"
[controlplane] Wrote Static Pod manifest for component kube-controller-manager to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests346927148/kube-controller-manager.yaml"
[controlplane] Wrote Static Pod manifest for component kube-scheduler to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests346927148/kube-scheduler.yaml"
[upgrade/staticpods] The etcd manifest will be restored if component "kube-apiserver" fails to upgrade
[certificates] Using the existing etcd/ca certificate and key.
[certificates] Using the existing apiserver-etcd-client certificate and key.
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-apiserver.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests190581659/kube-apiserver.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/apply] FATAL: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: [timed out waiting for the condition]
Tingkatkan v1.10.2 -> v1.10.2 (yang mungkin tidak masuk akal. Hanya menguji ...)
Ubuntu 16.04.
Dan, gagal dengan kesalahan.
kubeadm upgrade apply v1.10.2
[upgrade/apply] FATAL: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: [timed out waiting for the condition]
Saya ingin tahu apakah ini masih terlacak pada beberapa masalah ... tidak dapat ditemukan.
Saya juga melihat peningkatan masih gagal dengan kesalahan timed out waiting for the condition
.
Edit: Diskusi dipindahkan ke tiket baru https://github.com/kubernetes/kubeadm/issues/850 , harap diskusikan di sana.
Jika ada orang lain yang mengalami masalah ini dengan 1.9.x:
Jika Anda berada di aws dengan nama host khusus, Anda perlu mengedit peta konfigurasi kubeadm-config dan menetapkan pada nodeName nama internal aws: ip-xx-xx-xx-xx. $ REGION.compute.internal)
kubectl -n kube-system edit cm kubeadm-config -oyaml
Ini selain mengatur klien dll ke http. Saya belum menggunakan versi huruf untuk melihat apakah mereka memperbaikinya.
Ini karena kubeadm mencoba membaca jalur ini di api: / api / v1 / namespaces / kube-system / pods / kube-apiserver- $ NodeName
Karena batas waktu telah ditingkatkan pada 1.10.6, saya telah berhasil memperbarui penerapan 1.9.7 saya menjadi 1.10.6 beberapa minggu yang lalu.
Berencana untuk meningkatkan ke 1.11.2 segera setelah paket .deb siap karena perubahan yang sama ada di versi ini.
Kluster saya berjalan secara lokal di papan ARM64.
Komentar yang paling membantu
Solusi sementara adalah memastikan sertifikat dan mengupgrade pod etcd dan apiserver dengan melewati pemeriksaan.
Pastikan untuk memeriksa Config Anda dan tambahkan flag apa pun untuk kasus penggunaan Anda: