Kubeadm: Mengupgrade dari 1.9.6 ke 1.10.0 gagal dengan batas waktu

Dibuat pada 28 Mar 2018  ·  42Komentar  ·  Sumber: kubernetes/kubeadm

LAPORAN BUG

Versi

versi kubeadm (gunakan kubeadm version ):

versi kubeadm: & version.Info {Mayor: "1", Minor: "10", GitVersion: "v1.10.0", GitCommit: "fc32d2f3698e36b93322a3465f63a14e9f0eaead", GitTreeState: "clean", BuildDate: "2018-03-26T16: 44: 10Z ", GoVersion:" go1.9.3 ", Penyusun:" gc ", Platform:" linux / amd64 "}

Lingkungan :

  • Versi Kubernetes (gunakan kubectl version ):

Versi Klien: version.Info {Major: "1", Minor: "9", GitVersion: "v1.9.6", GitCommit: "9f8ebd171479bec0ada837d7ee641dec2f8c6dd1", GitTreeState: "clean", BuildDate: "2018-03-21T15: 21: 50Z ", GoVersion:" go1.9.3 ", Penyusun:" gc ", Platform:" linux / amd64 "}
Versi Server: version.Info {Mayor: "1", Minor: "9", GitVersion: "v1.9.6", GitCommit: "9f8ebd171479bec0ada837d7ee641dec2f8c6dd1", GitTreeState: "clean", BuildDate: "2018-03-21T15: 13: 31Z ", GoVersion:" go1.9.3 ", Penyusun:" gc ", Platform:" linux / amd64 "}

  • Penyedia cloud atau konfigurasi perangkat keras :

Scaleway baremetal C2S

  • OS (misalnya dari / etc / os-release):

Ubuntu Xenial (16.04 LTS) (GNU / Linux 4.4.122-mainline-rev1 x86_64)

  • Kernel (misalnya uname -a ):

Linux amd64-master-1 4.4.122-mainline-rev1 # 1 SMP Min 18 Mar 10:44:19 UTC 2018 x86_64 x86_64 x86_64 GNU / Linux

Apa yang terjadi?

Mencoba meningkatkan dari 1.9.6 ke 1.10.0 Saya mendapatkan kesalahan ini:

kubeadm upgrade apply v1.10.0
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[upgrade/version] You have chosen to change the cluster version to "v1.10.0"
[upgrade/versions] Cluster version: v1.9.6
[upgrade/versions] kubeadm version: v1.10.0
[upgrade/confirm] Are you sure you want to proceed with the upgrade? [y/N]: y
[upgrade/prepull] Will prepull images for components [kube-apiserver kube-controller-manager kube-scheduler]
[upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.10.0"...
[etcd] Wrote Static Pod manifest for a local etcd instance to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests411909119/etcd.yaml"
[certificates] Generated etcd/ca certificate and key.
[certificates] Generated etcd/server certificate and key.
[certificates] etcd/server serving cert is signed for DNS names [localhost] and IPs [127.0.0.1]
[certificates] Generated etcd/peer certificate and key.
[certificates] etcd/peer serving cert is signed for DNS names [arm-master-1] and IPs [10.1.244.57]
[certificates] Generated etcd/healthcheck-client certificate and key.
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/etcd.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests180476754/etcd.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/apply] FATAL: fatal error when trying to upgrade the etcd cluster: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: [timed out waiting for the condition], rolled the state back to pre-upgrade state

Apa yang Anda harapkan terjadi?

Peningkatan berhasil

Bagaimana cara memperbanyaknya (seminimal dan setepat mungkin)?

Instal paket 1.9.6 dan init cluster 1.9.6:

curl -fsSL https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
echo "deb http://apt.kubernetes.io/ kubernetes-xenial main" | tee /etc/apt/sources.list.d/kubernetes.list
apt-get update -qq
apt-get install -qy kubectl=1.9.6-00
apt-get install -qy kubelet=1.9.6-00
apt-get install -qy kubeadm=1.9.6-00

Edit kubeadm-config dan ubah featureGates dari string ke map seperti yang dilaporkan di https://github.com/kubernetes/kubernetes/issues/61764 .

kubectl -n kube-system edit cm kubeadm-config

....
featureGates: {}
....

Unduh kubeadm 1.10.0 dan jalankan kubeadm upgrade plan dan kubeadm upgrade apply v1.10.0 .

kinbug prioritcritical-urgent triaged

Komentar yang paling membantu

Solusi sementara adalah memastikan sertifikat dan mengupgrade pod etcd dan apiserver dengan melewati pemeriksaan.

Pastikan untuk memeriksa Config Anda dan tambahkan flag apa pun untuk kasus penggunaan Anda:

kubectl -n kube-system edit cm kubeadm-config  # change featureFlags
...
  featureGates: {}
...
kubeadm alpha phase certs all
kubeadm alpha phase etcd local
kubeadm alpha phase controlplane all
kubeadm alpha phase upload-config

Semua 42 komentar

Bekerja untuk mereproduksi bug ini secara lokal.

Setelah mencoba ulang ini selama 10 kali akhirnya berhasil

Inilah diff manifest etcd saya
`` root @ vagrant : ~ # diff /etc/kubernetes/manifests/etcd.yaml /etc/kubernetes/tmp/kubeadm-backup-manifests858209931/etcd.yaml
16,17c16,17
<- --listen-client-urls = https://127.0.0.1 : 2379

<- --advertise-client-urls = https://127.0.0.1 : 2379

- --listen-client-urls=http://127.0.0.1:2379
- --advertise-client-urls=http://127.0.0.1:2379

19,27c19
<- --key-file = / etc / kubernetes / pki / etcd / server.key
<- --trusted-ca-file = / etc / kubernetes / pki / etcd / ca.crt
<- --peer-cert-file = / etc / kubernetes / pki / etcd / peer.crt
<- --peer-key-file = / etc / kubernetes / pki / etcd / peer.key
<- --client-cert-auth = true
<- --peer-client-cert-auth = true
<- --cert-file = / etc / kubernetes / pki / etcd / server.crt
<- --peer-trusted-ca-file = / etc / kubernetes / pki / etcd / ca.crt

<image: gcr.io/google_containers/etcd-amd64:3.1.12

image: gcr.io/google_containers/etcd-amd64:3.1.11

29,35d20
<exec:
<perintah:
<- / bin / sh
<- -ec
<- ETCDCTL_API = 3 etcdctl --endpoints = 127.0.0.1: 2379 --cacert = / etc / kubernetes / pki / etcd / ca.crt
<--cert = / etc / kubernetes / pki / etcd / healthcheck-client.crt --key = / etc / kubernetes / pki / etcd / healthcheck-client.key
<get foo
36a22,26
httpGet:
tuan rumah: 127.0.0.1
jalur: / kesehatan
port: 2379
skema: HTTP
43,45c33
<nama: etcd-data
<- mountPath: / etc / kubernetes / pki / etcd

<nama: etcd-certs

  name: etcd

51,55c39
<nama: etcd-data
<- hostPath:
<path: / etc / kubernetes / pki / etcd
<type: DirectoryOrCreate

<nama: etcd-certs

name: etcd

root @ gelandangan : ~ # ls / etc / kubernetes / pki / etcd
ca.crt ca.key healthcheck-client.crt healthcheck-client.key peer.crt peer.key server.crt server.key```

1.9.6 cluster di Ubuntu 17.10 Vagrant:

root<strong i="6">@vagrant</strong>:/vagrant# 1.10_kubernetes/server/bin/kubeadm upgrade apply v1.10.0
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[upgrade/version] You have chosen to change the cluster version to "v1.10.0"
[upgrade/versions] Cluster version: v1.9.6
[upgrade/versions] kubeadm version: v1.10.0
[upgrade/confirm] Are you sure you want to proceed with the upgrade? [y/N]: y
[upgrade/prepull] Will prepull images for components [kube-apiserver kube-controller-manager kube-scheduler]
[upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.10.0"...
[etcd] Wrote Static Pod manifest for a local etcd instance to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests262738652/etcd.yaml"
[certificates] Generated etcd/ca certificate and key.
[certificates] Generated etcd/server certificate and key.
[certificates] etcd/server serving cert is signed for DNS names [localhost] and IPs [127.0.0.1]
[certificates] Generated etcd/peer certificate and key.
[certificates] etcd/peer serving cert is signed for DNS names [vagrant] and IPs [10.0.2.15]
[certificates] Generated etcd/healthcheck-client certificate and key.
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/etcd.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests858209931/etcd.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[apiclient] Error getting Pods with label selector "component=etcd" [the server was unable to return a response in the time allotted, but may still be processing the request (get pods)]
[apiclient] Error getting Pods with label selector "component=etcd" [Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods?labelSelector=component%3Detcd: http2: server sent GOAWAY and closed the connection; LastStreamID=27, ErrCode=NO_ERROR, debug=""]
[apiclient] Error getting Pods with label selector "component=etcd" [Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods?labelSelector=component%3Detcd: net/http: TLS handshake timeout]
[apiclient] Error getting Pods with label selector "component=etcd" [the server was unable to return a response in the time allotted, but may still be processing the request (get pods)]
[apiclient] Error getting Pods with label selector "component=etcd" [Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods?labelSelector=component%3Detcd: http2: server sent GOAWAY and closed the connection; LastStreamID=3, ErrCode=NO_ERROR, debug=""]
[upgrade/apply] FATAL: fatal error when trying to upgrade the etcd cluster: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: [timed out waiting for the condition], rolled the state back to pre-upgrade state

Ini adalah lingkungan repro saya: https://github.com/stealthybox/vagrant-kubeadm-testing

Ubah baris ini menjadi 1.9.6-00 untuk bootstrap: https://github.com/stealthybox/vagrant-kubeadm-testing/blob/9d4493e990c9bd742107b317641267c3ef3640cd/Vagrantfile#L18 -L20

Kemudian unduh binari server 1,10 ke dalam repo, dan mereka akan tersedia untuk tamu dalam /vagrant
https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.10.md#server -binaries

kubelet etcd terkait log:

root<strong i="6">@vagrant</strong>:~# journalctl -xefu kubelet | grep -i etcd
Mar 28 16:32:07 vagrant kubelet[14676]: W0328 16:32:07.808776   14676 status_manager.go:459] Failed to get status for pod "etcd-vagrant_kube-system(7278f85057e8bf5cb81c9f96d3b25320)": Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 16:32:07 vagrant kubelet[14676]: I0328 16:32:07.880412   14676 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "etcd" (UniqueName: "kubernetes.io/host-path/7278f85057e8bf5cb81c9f96d3b25320-etcd") pod "etcd-vagrant" (UID: "7278f85057e8bf5cb81c9f96d3b25320")
Mar 28 16:34:27 vagrant kubelet[14676]: W0328 16:34:27.472534   14676 status_manager.go:459] Failed to get status for pod "etcd-vagrant_kube-system(7278f85057e8bf5cb81c9f96d3b25320)": Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 16:57:33 vagrant kubelet[14676]: W0328 16:57:33.683648   14676 kubelet.go:1597] Deleting mirror pod "etcd-vagrant_kube-system(122348c3-32a6-11e8-8dc5-080027d6be16)" because it is outdated
Mar 28 16:57:33 vagrant kubelet[14676]: I0328 16:57:33.725564   14676 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "etcd-certs" (UniqueName: "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-certs") pod "etcd-vagrant" (UID: "37936d2107e31b457cada6c2433469f1")
Mar 28 16:57:33 vagrant kubelet[14676]: I0328 16:57:33.725637   14676 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "etcd-data" (UniqueName: "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-data") pod "etcd-vagrant" (UID: "37936d2107e31b457cada6c2433469f1")
Mar 28 16:57:35 vagrant kubelet[14676]: E0328 16:57:35.484901   14676 kuberuntime_container.go:66] Can't make a ref to pod "etcd-vagrant_kube-system(7278f85057e8bf5cb81c9f96d3b25320)", container etcd: selfLink was empty, can't make reference
Mar 28 16:57:35 vagrant kubelet[14676]: I0328 16:57:35.889458   14676 reconciler.go:191] operationExecutor.UnmountVolume started for volume "etcd" (UniqueName: "kubernetes.io/host-path/7278f85057e8bf5cb81c9f96d3b25320-etcd") pod "7278f85057e8bf5cb81c9f96d3b25320" (UID: "7278f85057e8bf5cb81c9f96d3b25320")
Mar 28 16:57:35 vagrant kubelet[14676]: I0328 16:57:35.889595   14676 operation_generator.go:643] UnmountVolume.TearDown succeeded for volume "kubernetes.io/host-path/7278f85057e8bf5cb81c9f96d3b25320-etcd" (OuterVolumeSpecName: "etcd") pod "7278f85057e8bf5cb81c9f96d3b25320" (UID: "7278f85057e8bf5cb81c9f96d3b25320"). InnerVolumeSpecName "etcd". PluginName "kubernetes.io/host-path", VolumeGidValue ""
Mar 28 16:57:35 vagrant kubelet[14676]: I0328 16:57:35.989892   14676 reconciler.go:297] Volume detached for volume "etcd" (UniqueName: "kubernetes.io/host-path/7278f85057e8bf5cb81c9f96d3b25320-etcd") on node "vagrant" DevicePath ""
Mar 28 16:58:03 vagrant kubelet[14676]: E0328 16:58:03.688878   14676 mirror_client.go:88] Failed deleting a mirror pod "etcd-vagrant_kube-system": Timeout: request did not complete within allowed duration
Mar 28 16:58:03 vagrant kubelet[14676]: E0328 16:58:03.841447   14676 event.go:200] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"etcd-vagrant.152023ff626cfbc5", GenerateName:"", Namespace:"kube-system", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"etcd-vagrant", UID:"37936d2107e31b457cada6c2433469f1", APIVersion:"v1", ResourceVersion:"", FieldPath:""}, Reason:"SuccessfulMountVolume", Message:"MountVolume.SetUp succeeded for volume \"etcd-certs\" ", Source:v1.EventSource{Component:"kubelet", Host:"vagrant"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbea7103f713e59c5, ext:1534226953099, loc:(*time.Location)(0x5859e60)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbea7103f713e59c5, ext:1534226953099, loc:(*time.Location)(0x5859e60)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Timeout: request did not complete within allowed duration' (will not retry!)
Mar 28 16:58:33 vagrant kubelet[14676]: E0328 16:58:33.844276   14676 event.go:200] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"etcd-vagrant.152023ff626cfb82", GenerateName:"", Namespace:"kube-system", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"etcd-vagrant", UID:"37936d2107e31b457cada6c2433469f1", APIVersion:"v1", ResourceVersion:"", FieldPath:""}, Reason:"SuccessfulMountVolume", Message:"MountVolume.SetUp succeeded for volume \"etcd-data\" ", Source:v1.EventSource{Component:"kubelet", Host:"vagrant"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbea7103f713e5982, ext:1534226953033, loc:(*time.Location)(0x5859e60)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbea7103f713e5982, ext:1534226953033, loc:(*time.Location)(0x5859e60)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Timeout: request did not complete within allowed duration' (will not retry!)
Mar 28 16:59:03 vagrant kubelet[14676]: E0328 16:59:03.692450   14676 kubelet.go:1612] Failed creating a mirror pod for "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)": the server was unable to return a response in the time allotted, but may still be processing the request (post pods)
Mar 28 16:59:03 vagrant kubelet[14676]: E0328 16:59:03.848007   14676 event.go:200] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"etcd-vagrant.152023ff641f915f", GenerateName:"", Namespace:"kube-system", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"etcd-vagrant", UID:"7278f85057e8bf5cb81c9f96d3b25320", APIVersion:"v1", ResourceVersion:"", FieldPath:"spec.containers{etcd}"}, Reason:"Killing", Message:"Killing container with id docker://etcd:Need to kill Pod", Source:v1.EventSource{Component:"kubelet", Host:"vagrant"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbea7103f72f0ef5f, ext:1534255433999, loc:(*time.Location)(0x5859e60)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbea7103f72f0ef5f, ext:1534255433999, loc:(*time.Location)(0x5859e60)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Timeout: request did not complete within allowed duration' (will not retry!)
Mar 28 16:59:14 vagrant kubelet[14676]: W0328 16:59:14.472661   14676 kubelet.go:1597] Deleting mirror pod "etcd-vagrant_kube-system(122348c3-32a6-11e8-8dc5-080027d6be16)" because it is outdated
Mar 28 16:59:14 vagrant kubelet[14676]: W0328 16:59:14.473138   14676 status_manager.go:459] Failed to get status for pod "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)": Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 16:59:14 vagrant kubelet[14676]: E0328 16:59:14.473190   14676 mirror_client.go:88] Failed deleting a mirror pod "etcd-vagrant_kube-system": Delete https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 16:59:14 vagrant kubelet[14676]: E0328 16:59:14.473658   14676 kubelet.go:1612] Failed creating a mirror pod for "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)": Post https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 16:59:15 vagrant kubelet[14676]: W0328 16:59:15.481336   14676 kubelet.go:1597] Deleting mirror pod "etcd-vagrant_kube-system(122348c3-32a6-11e8-8dc5-080027d6be16)" because it is outdated
Mar 28 16:59:15 vagrant kubelet[14676]: E0328 16:59:15.483705   14676 mirror_client.go:88] Failed deleting a mirror pod "etcd-vagrant_kube-system": Delete https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 16:59:15 vagrant kubelet[14676]: E0328 16:59:15.497391   14676 kubelet.go:1612] Failed creating a mirror pod for "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)": Post https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 17:00:34 vagrant kubelet[14676]: W0328 17:00:34.475851   14676 kubelet.go:1597] Deleting mirror pod "etcd-vagrant_kube-system(122348c3-32a6-11e8-8dc5-080027d6be16)" because it is outdated
Mar 28 17:01:07 vagrant kubelet[14676]: W0328 17:01:07.720076   14676 status_manager.go:459] Failed to get status for pod "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)": Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: http2: server sent GOAWAY and closed the connection; LastStreamID=47, ErrCode=NO_ERROR, debug=""
Mar 28 17:01:07 vagrant kubelet[14676]: E0328 17:01:07.720107   14676 mirror_client.go:88] Failed deleting a mirror pod "etcd-vagrant_kube-system": Delete https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: http2: server sent GOAWAY and closed the connection; LastStreamID=47, ErrCode=NO_ERROR, debug=""; some request body already written
Mar 28 17:01:07 vagrant kubelet[14676]: E0328 17:01:07.725335   14676 kubelet.go:1612] Failed creating a mirror pod for "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)": Post https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 17:01:07 vagrant kubelet[14676]: I0328 17:01:07.728709   14676 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "etcd" (UniqueName: "kubernetes.io/host-path/7278f85057e8bf5cb81c9f96d3b25320-etcd") pod "etcd-vagrant" (UID: "7278f85057e8bf5cb81c9f96d3b25320")
Mar 28 17:01:07 vagrant kubelet[14676]: W0328 17:01:07.734475   14676 status_manager.go:459] Failed to get status for pod "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)": Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 17:01:07 vagrant kubelet[14676]: W0328 17:01:07.740642   14676 status_manager.go:459] Failed to get status for pod "etcd-vagrant_kube-system(7278f85057e8bf5cb81c9f96d3b25320)": Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 17:01:09 vagrant kubelet[14676]: E0328 17:01:09.484412   14676 kuberuntime_container.go:66] Can't make a ref to pod "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)", container etcd: selfLink was empty, can't make reference
Mar 28 17:01:09 vagrant kubelet[14676]: I0328 17:01:09.848794   14676 reconciler.go:191] operationExecutor.UnmountVolume started for volume "etcd-certs" (UniqueName: "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-certs") pod "37936d2107e31b457cada6c2433469f1" (UID: "37936d2107e31b457cada6c2433469f1")
Mar 28 17:01:09 vagrant kubelet[14676]: I0328 17:01:09.849282   14676 reconciler.go:191] operationExecutor.UnmountVolume started for volume "etcd-data" (UniqueName: "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-data") pod "37936d2107e31b457cada6c2433469f1" (UID: "37936d2107e31b457cada6c2433469f1")
Mar 28 17:01:09 vagrant kubelet[14676]: I0328 17:01:09.849571   14676 operation_generator.go:643] UnmountVolume.TearDown succeeded for volume "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-data" (OuterVolumeSpecName: "etcd-data") pod "37936d2107e31b457cada6c2433469f1" (UID: "37936d2107e31b457cada6c2433469f1"). InnerVolumeSpecName "etcd-data". PluginName "kubernetes.io/host-path", VolumeGidValue ""
Mar 28 17:01:09 vagrant kubelet[14676]: I0328 17:01:09.849503   14676 operation_generator.go:643] UnmountVolume.TearDown succeeded for volume "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-certs" (OuterVolumeSpecName: "etcd-certs") pod "37936d2107e31b457cada6c2433469f1" (UID: "37936d2107e31b457cada6c2433469f1"). InnerVolumeSpecName "etcd-certs". PluginName "kubernetes.io/host-path", VolumeGidValue ""
Mar 28 17:01:09 vagrant kubelet[14676]: I0328 17:01:09.949925   14676 reconciler.go:297] Volume detached for volume "etcd-certs" (UniqueName: "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-certs") on node "vagrant" DevicePath ""
Mar 28 17:01:09 vagrant kubelet[14676]: I0328 17:01:09.949975   14676 reconciler.go:297] Volume detached for volume "etcd-data" (UniqueName: "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-data") on node "vagrant" DevicePath ""

Solusi saat ini adalah terus mencoba kembali peningkatan dan pada titik tertentu itu akan berhasil.

@stealthybox Apakah Anda kebetulan mendapatkan log keluar dari buruh pelabuhan untuk kontainer etcd? juga, grep -i etcd mungkin menutupi beberapa keluaran kubelet, misalnya beberapa pesan kesalahan yang tidak memiliki nama kontainer di dalamnya, tetapi masih relevan.

Saya baru saja menemukan kasus tepi aneh lain yang terkait dengan bug ini. Pemutakhiran kubeadm menandai pemutakhiran etcd selesai sebelum gambar etcd baru ditarik dan pod statis baru diterapkan. Hal ini menyebabkan pemutakhiran ke waktu tunggu di langkah selanjutnya dan pembatalan pemutakhiran gagal. Ini juga membuat cluster dalam keadaan rusak. Diperlukan pemulihan manifes pod statis etcd asli untuk memulihkan cluster.

Oh ya saya juga terjebak di sana. cluster saya benar-benar down. Dapatkah seseorang membagikan beberapa instruksi tentang cara menyelamatkan dari keadaan ini?

Berada di sana pada upaya kedua saya untuk meningkatkan, seperti yang dijelaskan @detiber , cukup menyakitkan. :menangis:

Menemukan beberapa hal yang dicadangkan di / etc / kubernetes / tmp, merasa bahwa etcd mungkin pelakunya, saya menyalin manifes lama ke yang baru di folder manifests. Pada saat itu saya tidak akan rugi apa-apa, karena saya benar-benar kehilangan kendali atas cluster tersebut. Kemudian, saya tidak ingat persis, tapi saya pikir saya me-restart seluruh mesin, dan kemudian menurunkan semua hal kembali ke v1.9.6. Akhirnya, saya menguasai cluster dan kehilangan motivasi untuk mengacaukan v1.10.0 lagi. Itu sama sekali tidak menyenangkan ...

Jika Anda mengembalikan manifes pod statis etcd dari /etc/kubernetes/tmp , penting juga untuk mengembalikan manifes apiserver ke versi 1.9 karena konfigurasi TLS baru di 1.10.

^ Anda mungkin tidak perlu melakukan ini karena saya percaya peningkatan etcd memblokir sisa dari peningkatan controlplane.

Tampaknya hanya manifes etcd yang tidak dapat diputar kembali pada peningkatan yang gagal, yang lainnya baik-baik saja. Setelah memindahkan manifes cadangan dan memulai ulang kubelet, semuanya kembali baik-baik saja.

Saya menghadapi masalah batas waktu yang sama dan kubeadm memutar kembali manifes kube-apiserv ke 1.9.6, tetapi meninggalkan manifest etcd apa adanya (baca: dengan TLS diaktifkan), jelas menyebabkan apiserv gagal total, secara efektif merusak node master saya. Calon yang baik untuk laporan masalah terpisah, saya kira.

@dvdmuckle @codepainters , sayangnya itu tergantung pada komponen mana yang mencapai kondisi balapan (etcd atau server api) apakah rollback berhasil. Saya menemukan perbaikan untuk kondisi balapan, tetapi itu benar-benar merusak peningkatan kubeadm. Saya bekerja dengan @stealthybox untuk mencoba dan menemukan jalan yang tepat untuk memperbaiki pemutakhiran dengan benar.

@codemainters Saya pikir itu adalah masalah yang sama.

Ada beberapa masalah mendasar yang menyebabkan masalah ini:

  • Proses upgrade menghasilkan hash pod cermin untuk setiap komponen dari hasil kueri pod cermin dari API. Upgrade kemudian menguji apakah nilai hash ini berubah untuk menentukan apakah pod diperbarui dari perubahan manifes statis. Nilai hash mencakup kolom yang dapat dimutasi karena alasan selain perubahan manifes statis (seperti update status pod). Jika status pod berubah di antara perbandingan hash, upgrade akan dilanjutkan ke komponen berikutnya sebelum waktunya.
  • Pemutakhiran melakukan pembaruan manifes pod statis etcd (termasuk menambahkan keamanan tls ke etcd) dan mencoba menggunakan apiserver untuk memverifikasi bahwa pod telah diperbarui, namun manifes apiserver belum diperbarui pada saat ini untuk menggunakan tls untuk berkomunikasi dengan etcd .

Akibatnya, upgrade hanya berhasil saat ini terjadi jika ada update status pod untuk pod etcd yang menyebabkan hash berubah sebelum kubelet mengambil manifest statis baru untuk etcd. Selain itu, server api harus tetap tersedia untuk bagian pertama dari peningkatan apiserver saat perkakas pemutakhiran meminta api sebelum memperbarui manifes apiserver.

@detiber dan saya mendapat telepon untuk membahas perubahan yang perlu kami lakukan pada proses peningkatan.
Kami berencana untuk menerapkan 3 perbaikan untuk bug ini dalam rilis patch 1.10.x :

  • Hapus etcd TLS dari upgrade.
    Loop pemutakhiran saat ini melakukan modifikasi batch per komponen secara serial.
    Mengupgrade komponen tidak memiliki pengetahuan tentang konfigurasi komponen dependen.
    Memverifikasi peningkatan membutuhkan APIServer tersedia untuk memeriksa status pod.
    Etcd TLS memerlukan perubahan konfigurasi etcd + apiserver gabungan yang memutuskan kontrak ini.
    Ini adalah perubahan minimum yang dapat dilakukan untuk memperbaiki masalah ini, dan meninggalkan cluster yang ditingkatkan dengan etcd tidak aman.

  • Perbaiki kondisi balapan hash mirror-pod pada perubahan status pod.
    https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/phases/upgrade/staticpods.go#L189.
    Upgrade sekarang akan benar dengan asumsi kompatibilitas antara flag etcd dan apiserver.

  • Tingkatkan TLS secara khusus dalam fase terpisah.
    Etcd dan APIServer perlu ditingkatkan bersama.
    kubeadm alpha phase ensure-etcd-tls ?.
    Fase ini harus dapat dijalankan secara independen dari peningkatan klaster.
    Selama pemutakhiran cluster, fase ini harus dijalankan sebelum memperbarui semua komponen.


Untuk 1.11 kami ingin:

  • Gunakan API kubelet untuk memeriksa runtime pod statis yang telah diupgrade.
    Tidak diinginkan untuk mengandalkan apiserver dan etcd untuk memantau proses lokal seperti yang sedang kami lakukan.
    Sumber data lokal tentang pod lebih baik daripada mengandalkan komponen kubernetes terdistribusi dengan urutan lebih tinggi.
    Ini akan menggantikan pemeriksaan runtime pod saat ini di loop pemutakhiran.
    Ini akan memungkinkan kita untuk menambahkan pemeriksaan ke fase memastikan-etcd-tls.

alternatif: Gunakan CRI untuk mendapatkan info pod (dapat didemokan menggunakan crictl ).
peringatan: CRI pada dockershim dan kemungkinan runtime container lainnya saat ini tidak mendukung kompatibilitas mundur untuk perubahan yang merusak CRI.

MELAKUKAN:

  • [] masalah terbuka dan tautkan untuk 4 perubahan ini.

PR untuk mengatasi kondisi balapan update pod statis: https://github.com/kubernetes/kubernetes/pull/61942
cherry-pick PR untuk rilis-1.10 cabang: https://github.com/kubernetes/kubernetes/pull/61954

@detiber apakah Anda keberatan menjelaskan kondisi ras apa yang sedang kita bicarakan? Saya tidak begitu akrab dengan internal kubeadm, namun kedengarannya menarik.

FYI - masalah yang sama / masalah yang ditingkatkan dari 1.9.3
Mencoba solusi untuk mencoba kembali beberapa kali. Akhirnya mencapai kondisi balapan dengan server API dan peningkatan tidak dapat dibatalkan.

@stealthybox thx, saya tidak mengerti saat pertama membaca.

Saya mengalami masalah yang sama .. [ERROR APIServerHealth]: Server API tidak sehat; / healthz tidak mengembalikan "oke"
[ERROR MasterNodesReady]: tidak dapat mencantumkan master dalam cluster: Dapatkan https ....... saat mengupgrade. Tolong bantu saya dengan ini. Saya meningkatkan dari 1.9.3 ke 1.10.0. Awalnya, ia bisa mencapai titik tertentu dari "[upgrade / staticpods] Menunggu kubelet untuk merestart komponen".

Solusi sementara adalah memastikan sertifikat dan mengupgrade pod etcd dan apiserver dengan melewati pemeriksaan.

Pastikan untuk memeriksa Config Anda dan tambahkan flag apa pun untuk kasus penggunaan Anda:

kubectl -n kube-system edit cm kubeadm-config  # change featureFlags
...
  featureGates: {}
...
kubeadm alpha phase certs all
kubeadm alpha phase etcd local
kubeadm alpha phase controlplane all
kubeadm alpha phase upload-config

Terima kasih @stealthybox
Bagi saya proses upgrade apply terhenti pada [upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.10.1"... namun cluster berhasil ditingkatkan.

@stealthybox Saya tidak yakin, tetapi tampaknya ada sesuatu yang rusak setelah langkah-langkah ini, karena kubeadm upgrade plan hang setelah itu:

[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[upgrade] Fetching available versions to upgrade to
[upgrade/versions] Cluster version: v1.10.1
[upgrade/versions] kubeadm version: v1.10.1
[upgrade/versions] Latest stable version: v1.10.1

Ketika menerapkan pembaruan saya telah menggantung [upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.10.1"... juga

@kvaps @stealthybox ini kemungkinan besar etcd masalah ( kubeadm berbicara biasa HTTP/2 untuk TLS-enabled etcd ), saya juga memukulnya. Lihat masalah lain ini: https://github.com/kubernetes/kubeadm/issues/755

Sejujurnya, saya tidak mengerti mengapa port TCP yang sama digunakan untuk pendengar TLS dan non-TLS etcd , itu hanya menyebabkan masalah seperti ini. Semakin jelas, _koneksi yang lama ditolak_ akan memberi petunjuk langsung, di sini saya harus menggunakan tcpdump untuk memahami apa yang sedang terjadi.

OH!
Tembak Anda benar, itu hanya berfungsi dengan tambalan TLS lokal saya untuk pemeriksaan status Etcd.

Lakukan ini untuk menyelesaikan peningkatan:

kubeadm alpha phase controlplane all
kubeadm alpha phase upload-config

mengedit solusi di atas menjadi benar

@stealthybox perintah kubeadm kedua tidak berfungsi:

# kubeadm alpha phase upload-config
The --config flag is mandatory

@renich berikan saja jalur file konfigurasi Anda

Jika Anda tidak menggunakan pengaturan kustom apa pun, Anda dapat memberikannya file kosong.
Berikut cara sederhana untuk melakukannya di bash:

1.10_kubernetes/server/bin/kubeadm alpha phase upload-config --config <(echo)

Ini sekarang harus diselesaikan dengan penggabungan https://github.com/kubernetes/kubernetes/pull/62655 dan akan menjadi bagian dari rilis v1.10.2.

Saya dapat mengonfirmasi bahwa peningkatan 1.10.0 -> 1.10.2 dengan kubeadm 1.10.2 lancar, tidak ada batas waktu

Saya masih memiliki waktu tunggu di 1.10.0 -> 1.10.2 tetapi yang lain:
[upgrade/staticpods] Waiting for the kubelet to restart the component Static pod: kube-apiserver-master hash: a273591d3207fcd9e6fd0c308cc68d64 [upgrade/apply] FATAL: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: [timed out waiting for the condition]

Saya tidak yakin apa yang harus saya lakukan ...

@ denis111 periksa log server API saat melakukan pemutakhiran menggunakan docker ps . Saya merasa Anda mungkin mengalami masalah yang juga saya alami.

@dvdmuckle Ya, saya tidak melihat kesalahan apa pun di log itu, hanya entri yang dimulai dengan I dan beberapa W.
Dan saya pikir hash dari kube-apiserver tidak berubah selama peningkatan.

Saya memiliki cluster ARM64 di 1.9.3 dan berhasil memperbarui ke 1.9.7 tetapi mendapat masalah batas waktu yang sama untuk meningkatkan dari 1.9.7 ke 1.10.2.

Saya bahkan mencoba mengedit dan mengkompilasi ulang kubeadm meningkatkan batas waktu (seperti komitmen terakhir ini https://github.com/anguslees/kubernetes/commits/kubeadm-gusfork) dengan hasil yang sama.

$ sudo kubeadm upgrade apply  v1.10.2 --force
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[upgrade/version] You have chosen to change the cluster version to "v1.10.2"
[upgrade/versions] Cluster version: v1.9.7
[upgrade/versions] kubeadm version: v1.10.2-dirty
[upgrade/version] Found 1 potential version compatibility errors but skipping since the --force flag is set:

   - Specified version to upgrade to "v1.10.2" is higher than the kubeadm version "v1.10.2-dirty". Upgrade kubeadm first using the tool you used to install kubeadm
[upgrade/prepull] Will prepull images for components [kube-apiserver kube-controller-manager kube-scheduler]
[upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.10.2"...
Static pod: kube-apiserver-kubemaster1 hash: ed7578d5bf9314188dca798386bcfb0e
Static pod: kube-controller-manager-kubemaster1 hash: e0c3f578f1c547dcf9996e1d3390c10c
Static pod: kube-scheduler-kubemaster1 hash: 52e767858f52ac4aba448b1a113884ee
[upgrade/etcd] Upgrading to TLS for etcd
Static pod: etcd-kubemaster1 hash: 413224efa82e36533ce93e30bd18e3a8
[etcd] Wrote Static Pod manifest for a local etcd instance to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests346927148/etcd.yaml"
[certificates] Using the existing etcd/ca certificate and key.
[certificates] Using the existing etcd/server certificate and key.
[certificates] Using the existing etcd/peer certificate and key.
[certificates] Using the existing etcd/healthcheck-client certificate and key.
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/etcd.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests190581659/etcd.yaml"
[upgrade/staticpods] Not waiting for pod-hash change for component "etcd"
[upgrade/etcd] Waiting for etcd to become available
[util/etcd] Waiting 30s for initial delay
[util/etcd] Attempting to get etcd status 1/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: getsockopt: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to get etcd status 2/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: getsockopt: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to get etcd status 3/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: getsockopt: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to get etcd status 4/10
[upgrade/staticpods] Writing new Static Pod manifests to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests346927148"
[controlplane] Wrote Static Pod manifest for component kube-apiserver to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests346927148/kube-apiserver.yaml"
[controlplane] Wrote Static Pod manifest for component kube-controller-manager to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests346927148/kube-controller-manager.yaml"
[controlplane] Wrote Static Pod manifest for component kube-scheduler to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests346927148/kube-scheduler.yaml"
[upgrade/staticpods] The etcd manifest will be restored if component "kube-apiserver" fails to upgrade
[certificates] Using the existing etcd/ca certificate and key.
[certificates] Using the existing apiserver-etcd-client certificate and key.
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-apiserver.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests190581659/kube-apiserver.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/apply] FATAL: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: [timed out waiting for the condition]

Tingkatkan v1.10.2 -> v1.10.2 (yang mungkin tidak masuk akal. Hanya menguji ...)

Ubuntu 16.04.

Dan, gagal dengan kesalahan.

kubeadm upgrade apply v1.10.2

[upgrade/apply] FATAL: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: [timed out waiting for the condition]

Saya ingin tahu apakah ini masih terlacak pada beberapa masalah ... tidak dapat ditemukan.

Saya juga melihat peningkatan masih gagal dengan kesalahan timed out waiting for the condition .

Edit: Diskusi dipindahkan ke tiket baru https://github.com/kubernetes/kubeadm/issues/850 , harap diskusikan di sana.

Jika ada orang lain yang mengalami masalah ini dengan 1.9.x:

Jika Anda berada di aws dengan nama host khusus, Anda perlu mengedit peta konfigurasi kubeadm-config dan menetapkan pada nodeName nama internal aws: ip-xx-xx-xx-xx. $ REGION.compute.internal)

kubectl -n kube-system edit cm kubeadm-config -oyaml

Ini selain mengatur klien dll ke http. Saya belum menggunakan versi huruf untuk melihat apakah mereka memperbaikinya.

Ini karena kubeadm mencoba membaca jalur ini di api: / api / v1 / namespaces / kube-system / pods / kube-apiserver- $ NodeName

Karena batas waktu telah ditingkatkan pada 1.10.6, saya telah berhasil memperbarui penerapan 1.9.7 saya menjadi 1.10.6 beberapa minggu yang lalu.

Berencana untuk meningkatkan ke 1.11.2 segera setelah paket .deb siap karena perubahan yang sama ada di versi ini.

Kluster saya berjalan secara lokal di papan ARM64.

Apakah halaman ini membantu?
0 / 5 - 0 peringkat