Kubeadm: Upgrading from 1.9.6 to 1.10.0 fails with timeout

Created on 28 Mar 2018  ·  42Comments  ·  Source: kubernetes/kubeadm

BUG REPORT

Versions

kubeadm version (use kubeadm version):

kubeadm version: &version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.0", GitCommit:"fc32d2f3698e36b93322a3465f63a14e9f0eaead", GitTreeState:"clean", BuildDate:"2018-03-26T16:44:10Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.6", GitCommit:"9f8ebd171479bec0ada837d7ee641dec2f8c6dd1", GitTreeState:"clean", BuildDate:"2018-03-21T15:21:50Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.6", GitCommit:"9f8ebd171479bec0ada837d7ee641dec2f8c6dd1", GitTreeState:"clean", BuildDate:"2018-03-21T15:13:31Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

  • Cloud provider or hardware configuration:

Scaleway baremetal C2S

  • OS (e.g. from /etc/os-release):

Ubuntu Xenial (16.04 LTS) (GNU/Linux 4.4.122-mainline-rev1 x86_64 )

  • Kernel (e.g. uname -a):

Linux amd64-master-1 4.4.122-mainline-rev1 #1 SMP Sun Mar 18 10:44:19 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

What happened?

Trying to upgrade from 1.9.6 to 1.10.0 I'm getting this error:

kubeadm upgrade apply v1.10.0
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[upgrade/version] You have chosen to change the cluster version to "v1.10.0"
[upgrade/versions] Cluster version: v1.9.6
[upgrade/versions] kubeadm version: v1.10.0
[upgrade/confirm] Are you sure you want to proceed with the upgrade? [y/N]: y
[upgrade/prepull] Will prepull images for components [kube-apiserver kube-controller-manager kube-scheduler]
[upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.10.0"...
[etcd] Wrote Static Pod manifest for a local etcd instance to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests411909119/etcd.yaml"
[certificates] Generated etcd/ca certificate and key.
[certificates] Generated etcd/server certificate and key.
[certificates] etcd/server serving cert is signed for DNS names [localhost] and IPs [127.0.0.1]
[certificates] Generated etcd/peer certificate and key.
[certificates] etcd/peer serving cert is signed for DNS names [arm-master-1] and IPs [10.1.244.57]
[certificates] Generated etcd/healthcheck-client certificate and key.
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/etcd.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests180476754/etcd.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/apply] FATAL: fatal error when trying to upgrade the etcd cluster: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: [timed out waiting for the condition], rolled the state back to pre-upgrade state

What you expected to happen?

Successful upgrade

How to reproduce it (as minimally and precisely as possible)?

Install 1.9.6 packages and init a 1.9.6 cluster:

curl -fsSL https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
echo "deb http://apt.kubernetes.io/ kubernetes-xenial main" | tee /etc/apt/sources.list.d/kubernetes.list
apt-get update -qq
apt-get install -qy kubectl=1.9.6-00
apt-get install -qy kubelet=1.9.6-00
apt-get install -qy kubeadm=1.9.6-00

Edit the kubeadm-config and change the featureGates from string to map as reported in https://github.com/kubernetes/kubernetes/issues/61764 .

kubectl -n kube-system edit cm kubeadm-config

....
featureGates: {}
....

Download kubeadm 1.10.0 and run kubeadm upgrade plan and kubeadm upgrade apply v1.10.0.

kinbug prioritcritical-urgent triaged

Most helpful comment

Temporary workaround is to ensure certs and upgrade the etcd and apiserver pods by bypassing the checks.

Be sure to check your Config and add any flags for your use case:

kubectl -n kube-system edit cm kubeadm-config  # change featureFlags
...
  featureGates: {}
...
kubeadm alpha phase certs all
kubeadm alpha phase etcd local
kubeadm alpha phase controlplane all
kubeadm alpha phase upload-config

All 42 comments

Working on reproducing this bug locally.

After retrying this for 10 times it finally worked

Here's my etcd manifest diff
```root@vagrant:~# diff /etc/kubernetes/manifests/etcd.yaml /etc/kubernetes/tmp/kubeadm-backup-manifests858209931/etcd.yaml
16,17c16,17
< - --listen-client-urls=https://127.0.0.1:2379

< - --advertise-client-urls=https://127.0.0.1:2379

- --listen-client-urls=http://127.0.0.1:2379
- --advertise-client-urls=http://127.0.0.1:2379

19,27c19
< - --key-file=/etc/kubernetes/pki/etcd/server.key
< - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
< - --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
< - --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
< - --client-cert-auth=true
< - --peer-client-cert-auth=true
< - --cert-file=/etc/kubernetes/pki/etcd/server.crt
< - --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt

< image: gcr.io/google_containers/etcd-amd64:3.1.12

image: gcr.io/google_containers/etcd-amd64:3.1.11

29,35d20
< exec:
< command:
< - /bin/sh
< - -ec
< - ETCDCTL_API=3 etcdctl --endpoints=127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt
< --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key
< get foo
36a22,26
httpGet:
host: 127.0.0.1
path: /health
port: 2379
scheme: HTTP
43,45c33
< name: etcd-data
< - mountPath: /etc/kubernetes/pki/etcd

< name: etcd-certs

  name: etcd

51,55c39
< name: etcd-data
< - hostPath:
< path: /etc/kubernetes/pki/etcd
< type: DirectoryOrCreate

< name: etcd-certs

name: etcd

root@vagrant:~# ls /etc/kubernetes/pki/etcd
ca.crt ca.key healthcheck-client.crt healthcheck-client.key peer.crt peer.key server.crt server.key```

1.9.6 cluster on Ubuntu 17.10 Vagrant:

root@vagrant:/vagrant# 1.10_kubernetes/server/bin/kubeadm upgrade apply v1.10.0
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[upgrade/version] You have chosen to change the cluster version to "v1.10.0"
[upgrade/versions] Cluster version: v1.9.6
[upgrade/versions] kubeadm version: v1.10.0
[upgrade/confirm] Are you sure you want to proceed with the upgrade? [y/N]: y
[upgrade/prepull] Will prepull images for components [kube-apiserver kube-controller-manager kube-scheduler]
[upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.10.0"...
[etcd] Wrote Static Pod manifest for a local etcd instance to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests262738652/etcd.yaml"
[certificates] Generated etcd/ca certificate and key.
[certificates] Generated etcd/server certificate and key.
[certificates] etcd/server serving cert is signed for DNS names [localhost] and IPs [127.0.0.1]
[certificates] Generated etcd/peer certificate and key.
[certificates] etcd/peer serving cert is signed for DNS names [vagrant] and IPs [10.0.2.15]
[certificates] Generated etcd/healthcheck-client certificate and key.
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/etcd.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests858209931/etcd.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[apiclient] Error getting Pods with label selector "component=etcd" [the server was unable to return a response in the time allotted, but may still be processing the request (get pods)]
[apiclient] Error getting Pods with label selector "component=etcd" [Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods?labelSelector=component%3Detcd: http2: server sent GOAWAY and closed the connection; LastStreamID=27, ErrCode=NO_ERROR, debug=""]
[apiclient] Error getting Pods with label selector "component=etcd" [Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods?labelSelector=component%3Detcd: net/http: TLS handshake timeout]
[apiclient] Error getting Pods with label selector "component=etcd" [the server was unable to return a response in the time allotted, but may still be processing the request (get pods)]
[apiclient] Error getting Pods with label selector "component=etcd" [Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods?labelSelector=component%3Detcd: http2: server sent GOAWAY and closed the connection; LastStreamID=3, ErrCode=NO_ERROR, debug=""]
[upgrade/apply] FATAL: fatal error when trying to upgrade the etcd cluster: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: [timed out waiting for the condition], rolled the state back to pre-upgrade state

This is my repro environment: https://github.com/stealthybox/vagrant-kubeadm-testing

Change these lines to 1.9.6-00 for the bootstrap: https://github.com/stealthybox/vagrant-kubeadm-testing/blob/9d4493e990c9bd742107b317641267c3ef3640cd/Vagrantfile#L18-L20

Then download the 1.10 server binaries into the repo, and they will be available on the guest in /vagrant
https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.10.md#server-binaries

kubelet etcd related logs:

root@vagrant:~# journalctl -xefu kubelet | grep -i etcd
Mar 28 16:32:07 vagrant kubelet[14676]: W0328 16:32:07.808776   14676 status_manager.go:459] Failed to get status for pod "etcd-vagrant_kube-system(7278f85057e8bf5cb81c9f96d3b25320)": Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 16:32:07 vagrant kubelet[14676]: I0328 16:32:07.880412   14676 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "etcd" (UniqueName: "kubernetes.io/host-path/7278f85057e8bf5cb81c9f96d3b25320-etcd") pod "etcd-vagrant" (UID: "7278f85057e8bf5cb81c9f96d3b25320")
Mar 28 16:34:27 vagrant kubelet[14676]: W0328 16:34:27.472534   14676 status_manager.go:459] Failed to get status for pod "etcd-vagrant_kube-system(7278f85057e8bf5cb81c9f96d3b25320)": Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 16:57:33 vagrant kubelet[14676]: W0328 16:57:33.683648   14676 kubelet.go:1597] Deleting mirror pod "etcd-vagrant_kube-system(122348c3-32a6-11e8-8dc5-080027d6be16)" because it is outdated
Mar 28 16:57:33 vagrant kubelet[14676]: I0328 16:57:33.725564   14676 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "etcd-certs" (UniqueName: "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-certs") pod "etcd-vagrant" (UID: "37936d2107e31b457cada6c2433469f1")
Mar 28 16:57:33 vagrant kubelet[14676]: I0328 16:57:33.725637   14676 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "etcd-data" (UniqueName: "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-data") pod "etcd-vagrant" (UID: "37936d2107e31b457cada6c2433469f1")
Mar 28 16:57:35 vagrant kubelet[14676]: E0328 16:57:35.484901   14676 kuberuntime_container.go:66] Can't make a ref to pod "etcd-vagrant_kube-system(7278f85057e8bf5cb81c9f96d3b25320)", container etcd: selfLink was empty, can't make reference
Mar 28 16:57:35 vagrant kubelet[14676]: I0328 16:57:35.889458   14676 reconciler.go:191] operationExecutor.UnmountVolume started for volume "etcd" (UniqueName: "kubernetes.io/host-path/7278f85057e8bf5cb81c9f96d3b25320-etcd") pod "7278f85057e8bf5cb81c9f96d3b25320" (UID: "7278f85057e8bf5cb81c9f96d3b25320")
Mar 28 16:57:35 vagrant kubelet[14676]: I0328 16:57:35.889595   14676 operation_generator.go:643] UnmountVolume.TearDown succeeded for volume "kubernetes.io/host-path/7278f85057e8bf5cb81c9f96d3b25320-etcd" (OuterVolumeSpecName: "etcd") pod "7278f85057e8bf5cb81c9f96d3b25320" (UID: "7278f85057e8bf5cb81c9f96d3b25320"). InnerVolumeSpecName "etcd". PluginName "kubernetes.io/host-path", VolumeGidValue ""
Mar 28 16:57:35 vagrant kubelet[14676]: I0328 16:57:35.989892   14676 reconciler.go:297] Volume detached for volume "etcd" (UniqueName: "kubernetes.io/host-path/7278f85057e8bf5cb81c9f96d3b25320-etcd") on node "vagrant" DevicePath ""
Mar 28 16:58:03 vagrant kubelet[14676]: E0328 16:58:03.688878   14676 mirror_client.go:88] Failed deleting a mirror pod "etcd-vagrant_kube-system": Timeout: request did not complete within allowed duration
Mar 28 16:58:03 vagrant kubelet[14676]: E0328 16:58:03.841447   14676 event.go:200] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"etcd-vagrant.152023ff626cfbc5", GenerateName:"", Namespace:"kube-system", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"etcd-vagrant", UID:"37936d2107e31b457cada6c2433469f1", APIVersion:"v1", ResourceVersion:"", FieldPath:""}, Reason:"SuccessfulMountVolume", Message:"MountVolume.SetUp succeeded for volume \"etcd-certs\" ", Source:v1.EventSource{Component:"kubelet", Host:"vagrant"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbea7103f713e59c5, ext:1534226953099, loc:(*time.Location)(0x5859e60)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbea7103f713e59c5, ext:1534226953099, loc:(*time.Location)(0x5859e60)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Timeout: request did not complete within allowed duration' (will not retry!)
Mar 28 16:58:33 vagrant kubelet[14676]: E0328 16:58:33.844276   14676 event.go:200] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"etcd-vagrant.152023ff626cfb82", GenerateName:"", Namespace:"kube-system", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"etcd-vagrant", UID:"37936d2107e31b457cada6c2433469f1", APIVersion:"v1", ResourceVersion:"", FieldPath:""}, Reason:"SuccessfulMountVolume", Message:"MountVolume.SetUp succeeded for volume \"etcd-data\" ", Source:v1.EventSource{Component:"kubelet", Host:"vagrant"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbea7103f713e5982, ext:1534226953033, loc:(*time.Location)(0x5859e60)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbea7103f713e5982, ext:1534226953033, loc:(*time.Location)(0x5859e60)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Timeout: request did not complete within allowed duration' (will not retry!)
Mar 28 16:59:03 vagrant kubelet[14676]: E0328 16:59:03.692450   14676 kubelet.go:1612] Failed creating a mirror pod for "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)": the server was unable to return a response in the time allotted, but may still be processing the request (post pods)
Mar 28 16:59:03 vagrant kubelet[14676]: E0328 16:59:03.848007   14676 event.go:200] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"etcd-vagrant.152023ff641f915f", GenerateName:"", Namespace:"kube-system", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"etcd-vagrant", UID:"7278f85057e8bf5cb81c9f96d3b25320", APIVersion:"v1", ResourceVersion:"", FieldPath:"spec.containers{etcd}"}, Reason:"Killing", Message:"Killing container with id docker://etcd:Need to kill Pod", Source:v1.EventSource{Component:"kubelet", Host:"vagrant"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbea7103f72f0ef5f, ext:1534255433999, loc:(*time.Location)(0x5859e60)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbea7103f72f0ef5f, ext:1534255433999, loc:(*time.Location)(0x5859e60)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Timeout: request did not complete within allowed duration' (will not retry!)
Mar 28 16:59:14 vagrant kubelet[14676]: W0328 16:59:14.472661   14676 kubelet.go:1597] Deleting mirror pod "etcd-vagrant_kube-system(122348c3-32a6-11e8-8dc5-080027d6be16)" because it is outdated
Mar 28 16:59:14 vagrant kubelet[14676]: W0328 16:59:14.473138   14676 status_manager.go:459] Failed to get status for pod "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)": Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 16:59:14 vagrant kubelet[14676]: E0328 16:59:14.473190   14676 mirror_client.go:88] Failed deleting a mirror pod "etcd-vagrant_kube-system": Delete https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 16:59:14 vagrant kubelet[14676]: E0328 16:59:14.473658   14676 kubelet.go:1612] Failed creating a mirror pod for "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)": Post https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 16:59:15 vagrant kubelet[14676]: W0328 16:59:15.481336   14676 kubelet.go:1597] Deleting mirror pod "etcd-vagrant_kube-system(122348c3-32a6-11e8-8dc5-080027d6be16)" because it is outdated
Mar 28 16:59:15 vagrant kubelet[14676]: E0328 16:59:15.483705   14676 mirror_client.go:88] Failed deleting a mirror pod "etcd-vagrant_kube-system": Delete https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 16:59:15 vagrant kubelet[14676]: E0328 16:59:15.497391   14676 kubelet.go:1612] Failed creating a mirror pod for "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)": Post https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 17:00:34 vagrant kubelet[14676]: W0328 17:00:34.475851   14676 kubelet.go:1597] Deleting mirror pod "etcd-vagrant_kube-system(122348c3-32a6-11e8-8dc5-080027d6be16)" because it is outdated
Mar 28 17:01:07 vagrant kubelet[14676]: W0328 17:01:07.720076   14676 status_manager.go:459] Failed to get status for pod "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)": Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: http2: server sent GOAWAY and closed the connection; LastStreamID=47, ErrCode=NO_ERROR, debug=""
Mar 28 17:01:07 vagrant kubelet[14676]: E0328 17:01:07.720107   14676 mirror_client.go:88] Failed deleting a mirror pod "etcd-vagrant_kube-system": Delete https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: http2: server sent GOAWAY and closed the connection; LastStreamID=47, ErrCode=NO_ERROR, debug=""; some request body already written
Mar 28 17:01:07 vagrant kubelet[14676]: E0328 17:01:07.725335   14676 kubelet.go:1612] Failed creating a mirror pod for "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)": Post https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 17:01:07 vagrant kubelet[14676]: I0328 17:01:07.728709   14676 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "etcd" (UniqueName: "kubernetes.io/host-path/7278f85057e8bf5cb81c9f96d3b25320-etcd") pod "etcd-vagrant" (UID: "7278f85057e8bf5cb81c9f96d3b25320")
Mar 28 17:01:07 vagrant kubelet[14676]: W0328 17:01:07.734475   14676 status_manager.go:459] Failed to get status for pod "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)": Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 17:01:07 vagrant kubelet[14676]: W0328 17:01:07.740642   14676 status_manager.go:459] Failed to get status for pod "etcd-vagrant_kube-system(7278f85057e8bf5cb81c9f96d3b25320)": Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 17:01:09 vagrant kubelet[14676]: E0328 17:01:09.484412   14676 kuberuntime_container.go:66] Can't make a ref to pod "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)", container etcd: selfLink was empty, can't make reference
Mar 28 17:01:09 vagrant kubelet[14676]: I0328 17:01:09.848794   14676 reconciler.go:191] operationExecutor.UnmountVolume started for volume "etcd-certs" (UniqueName: "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-certs") pod "37936d2107e31b457cada6c2433469f1" (UID: "37936d2107e31b457cada6c2433469f1")
Mar 28 17:01:09 vagrant kubelet[14676]: I0328 17:01:09.849282   14676 reconciler.go:191] operationExecutor.UnmountVolume started for volume "etcd-data" (UniqueName: "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-data") pod "37936d2107e31b457cada6c2433469f1" (UID: "37936d2107e31b457cada6c2433469f1")
Mar 28 17:01:09 vagrant kubelet[14676]: I0328 17:01:09.849571   14676 operation_generator.go:643] UnmountVolume.TearDown succeeded for volume "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-data" (OuterVolumeSpecName: "etcd-data") pod "37936d2107e31b457cada6c2433469f1" (UID: "37936d2107e31b457cada6c2433469f1"). InnerVolumeSpecName "etcd-data". PluginName "kubernetes.io/host-path", VolumeGidValue ""
Mar 28 17:01:09 vagrant kubelet[14676]: I0328 17:01:09.849503   14676 operation_generator.go:643] UnmountVolume.TearDown succeeded for volume "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-certs" (OuterVolumeSpecName: "etcd-certs") pod "37936d2107e31b457cada6c2433469f1" (UID: "37936d2107e31b457cada6c2433469f1"). InnerVolumeSpecName "etcd-certs". PluginName "kubernetes.io/host-path", VolumeGidValue ""
Mar 28 17:01:09 vagrant kubelet[14676]: I0328 17:01:09.949925   14676 reconciler.go:297] Volume detached for volume "etcd-certs" (UniqueName: "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-certs") on node "vagrant" DevicePath ""
Mar 28 17:01:09 vagrant kubelet[14676]: I0328 17:01:09.949975   14676 reconciler.go:297] Volume detached for volume "etcd-data" (UniqueName: "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-data") on node "vagrant" DevicePath ""

The current workaround is to keep retrying the upgrade and at some point it will succeed.

@stealthybox Do you happen to get logs out of docker for etcd container ? also, grep -i etcd might be masking some of the kubelet output, e.g. some error messages that don't have container name in it, but still relevant.

I just hit another weird edge case related to this bug. The kubeadm upgrade marked the etcd upgrade as complete prior to the new etcd image being pulled and the new static pod being deployed. This causes the upgrade to timeout at a later step and the upgrade rollback to fail. This also leaves the cluster in a broken state. Restoring the original etcd static pod manifest is needed to recover the cluster.

Oh yes I'm also stuck in there. my cluster is completely down. Can someone share some instructions on how to rescue from this state?

Been there on my 2nd attempt to upgrade, just like @detiber described, quite painful. :cry:

Found some backed up stuff in /etc/kubernetes/tmp, feeling that etcd might be the culprit, I copied its old manifest over the new one in the manifests folder. At that point I had nothing to lose, 'cause I completely lost control of the cluster. Then, I don't remember exactly, but I think I restarted the whole machine, and later on downgraded all stuff back to v1.9.6. Eventually, I gained control of the cluster and then lost any motivation to mess with v1.10.0 again. It was not fun at all...

If you roll back the etcd static pod manifest from /etc/kubernetes/tmp it's important to also roll back the apiserver manifest to the 1.9 version because of the new TLS configuration in 1.10.

^ you probably won't need to do this though because I believe the etcd upgrade blocks the rest of the controlplane upgrade.

It seems that only the etcd manifest isn't rolled back on a failed upgrade, everything else is fine. After moving the backup manifest over and restarting the kubelet, everything comes back fine.

I faced the same timeout problem and kubeadm rolled back kube-apiserv manifest to 1.9.6, but left etcd manifest as is (read: with TLS enabled), obviously leading apiserv to fail miserably, effectively breaking my master node. Good candidate for a separate issue report, I suppose.

@dvdmuckle @codepainters, unfortunately it depends on which component hits the race condition (etcd or api server) whether the rollback is successful. I found a fix for the race condition, but it completely breaks the kubeadm upgrade. I'm working with @stealthybox to try and find a proper path forward for properly fixing the upgrade.

@codepainters I think it is the same issue.

There are a few underlying problems causing this issue:

  • The upgrade is generating a hash of the mirror pod for each component from the result of querying the mirror pod from the API. The upgrade then tests to see if this hashed value changes to determine if the pod is updated from the static manifest change. The hashed value includes fields that can be mutated for reasons other than the static manifest change (such as pod status updates). If the pod status changes between the hash comparisons, then the upgrade will continue to the next component prematurely.
  • The upgrade performs the etcd static pod manifest update (including adding tls security to etcd) and attempts to use the apiserver to verify that the pod has been updated, however the apiserver manifest has not been updated at this point to use tls for communicating with etcd.

As a result, the upgrade only succeeds currently when there happens to be a pod status update for the etcd pod that causes the hash to change prior to the kubelet picking up the new static manifest for etcd. Additionally, the api server needs to remain available for the first part of the apiserver upgrade when the upgrade tooling is querying the api prior to updating the apiserver manifest.

@detiber and I got on a call to discuss changes we need to make to the upgrade process.
We plan to implement 3 fixes for this bug in the 1.10.x patch releases:

  • Remove etcd TLS from upgrade.
    The current upgrade loop does batch modifications per component in a serial manner.
    Upgrading a component has no knowledge of dependent component configurations.
    Verifying an upgrade requires the APIServer is available to check the pod status.
    Etcd TLS requires a coupled etcd+apiserver configuration change which breaks this contract.
    This is the minimum viable change to fix this issue, and leaves upgraded clusters with insecure etcd.

  • Fix the mirror-pod hash race condition on pod status change.
    https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/phases/upgrade/staticpods.go#L189.
    Upgrades will now be correct assuming compatibility between etcd and apiserver flags.

  • Upgrade TLS specifically in a separate phase.
    Etcd and the APIServer need to be upgraded together.
    kubeadm alpha phase ensure-etcd-tls ?.
    This phase should be runnable independently of a cluster upgrade.
    During a cluster upgrade, this phase should run before updating all of the components.


For 1.11 we want to:

  • Use the kubelet API for runtime checks of upgraded static pods.
    It's undesirable to rely on the apiserver and etcd for monitoring local processes like we are currently doing.
    A local source of data about pods is superior to relying on higher-order distributed kubernetes components.
    This will replace the current pod runtime checks in the upgrade loop.
    This will allow us to add checks to the ensure-etcd-tls phase.

alternative: Use the CRI to get pod info (demo'd viable using crictl).
caveat: CRI on dockershim and possibly other container runtimes does not currently support backward compatibility for CRI breaking changes.

TODO:

  • [ ] open and link issues for these 4 changes.

PR to address the static pod update race condition: https://github.com/kubernetes/kubernetes/pull/61942
cherry-pick PR for release-1.10 branch: https://github.com/kubernetes/kubernetes/pull/61954

@detiber do you mind explaining what race condition are we talking about? I'm not that familiar with kubeadm internals, yet it sounds interesting.

FYI - same problem/issue upgrading from 1.9.3
Tried the workaround of re-trying a number of times. Finally hit the race condition w/ the API server and the upgrade couldn't roll back.

@stealthybox thx, I didn't get it at first reading.

I am having the same issue..[ERROR APIServerHealth]: the API Server is unhealthy; /healthz didn't return "ok"
[ERROR MasterNodesReady]: couldn't list masters in cluster: Get https.......while upgrading. Please help me with this. I am upgrading from 1.9.3 to 1.10.0. Initially, it was able to reach certain point of "[upgrade/staticpods] Waiting for the kubelet to restart the component".

Temporary workaround is to ensure certs and upgrade the etcd and apiserver pods by bypassing the checks.

Be sure to check your Config and add any flags for your use case:

kubectl -n kube-system edit cm kubeadm-config  # change featureFlags
...
  featureGates: {}
...
kubeadm alpha phase certs all
kubeadm alpha phase etcd local
kubeadm alpha phase controlplane all
kubeadm alpha phase upload-config

Thanks @stealthybox
For me the upgrade apply process stalled on [upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.10.1"... however the cluster was successfully upgraded.

@stealthybox I'm not sure, but seems that something is broken after these steps, because kubeadm upgrade plan is hangs after that:

[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[upgrade] Fetching available versions to upgrade to
[upgrade/versions] Cluster version: v1.10.1
[upgrade/versions] kubeadm version: v1.10.1
[upgrade/versions] Latest stable version: v1.10.1

When apply update I had hanged [upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.10.1"... too

@kvaps @stealthybox this is most likely etcd issue (kubeadm speaks plain HTTP/2 to TLS-enabled etcd), I hit it too. See this other issue: https://github.com/kubernetes/kubeadm/issues/755

Honestly, I can't understand why is the same TCP port used for both TLS and non-TLS etcd listeners, it only causes troubles like this one. Getting plain, old _connection refused_ would give immediate hint, here I had to resort to tcpdump to understand what's going on.

OH!
Shoot you're right, that only works with my local TLS patch for the Etcd status check.

Do this to finish the upgrade:

kubeadm alpha phase controlplane all
kubeadm alpha phase upload-config

edited the above workaround to be correct

@stealthybox the second kubeadm command doesn't work:

# kubeadm alpha phase upload-config
The --config flag is mandatory

@renich just give it the filepath of your config

If you don't use any custom settings, you can pass it an empty file.
Here's a simple way to do that in bash:

1.10_kubernetes/server/bin/kubeadm alpha phase upload-config --config <(echo)

This should now be resolved with the merging of https://github.com/kubernetes/kubernetes/pull/62655 and will be part of the v1.10.2 release.

I can confirm that 1.10.0 -> 1.10.2 upgrade with kubeadm 1.10.2 was smooth, no timeouts

I still have timeout on 1.10.0 -> 1.10.2 but another one:
[upgrade/staticpods] Waiting for the kubelet to restart the component Static pod: kube-apiserver-master hash: a273591d3207fcd9e6fd0c308cc68d64 [upgrade/apply] FATAL: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: [timed out waiting for the condition]

I'm not sure what to do...

@denis111 check the API server logs while doing the upgrade using docker ps. I feel like you might be running into an issue I'm also running in to.

@dvdmuckle Well, I don't see any error in that logs, only entries starting with I and a few W.
And i think the hash of kube-apiserver doesn't change during the upgrade.

I have an ARM64 cluster on 1.9.3 and successfully updated to 1.9.7 but got the same timeout problem to upgrade from 1.9.7 to 1.10.2.

I even tried editing and recompiling kubeadm increasing the timeouts (like these last commits https://github.com/anguslees/kubernetes/commits/kubeadm-gusfork) with same results.

$ sudo kubeadm upgrade apply  v1.10.2 --force
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[upgrade/version] You have chosen to change the cluster version to "v1.10.2"
[upgrade/versions] Cluster version: v1.9.7
[upgrade/versions] kubeadm version: v1.10.2-dirty
[upgrade/version] Found 1 potential version compatibility errors but skipping since the --force flag is set:

   - Specified version to upgrade to "v1.10.2" is higher than the kubeadm version "v1.10.2-dirty". Upgrade kubeadm first using the tool you used to install kubeadm
[upgrade/prepull] Will prepull images for components [kube-apiserver kube-controller-manager kube-scheduler]
[upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.10.2"...
Static pod: kube-apiserver-kubemaster1 hash: ed7578d5bf9314188dca798386bcfb0e
Static pod: kube-controller-manager-kubemaster1 hash: e0c3f578f1c547dcf9996e1d3390c10c
Static pod: kube-scheduler-kubemaster1 hash: 52e767858f52ac4aba448b1a113884ee
[upgrade/etcd] Upgrading to TLS for etcd
Static pod: etcd-kubemaster1 hash: 413224efa82e36533ce93e30bd18e3a8
[etcd] Wrote Static Pod manifest for a local etcd instance to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests346927148/etcd.yaml"
[certificates] Using the existing etcd/ca certificate and key.
[certificates] Using the existing etcd/server certificate and key.
[certificates] Using the existing etcd/peer certificate and key.
[certificates] Using the existing etcd/healthcheck-client certificate and key.
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/etcd.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests190581659/etcd.yaml"
[upgrade/staticpods] Not waiting for pod-hash change for component "etcd"
[upgrade/etcd] Waiting for etcd to become available
[util/etcd] Waiting 30s for initial delay
[util/etcd] Attempting to get etcd status 1/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: getsockopt: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to get etcd status 2/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: getsockopt: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to get etcd status 3/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: getsockopt: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to get etcd status 4/10
[upgrade/staticpods] Writing new Static Pod manifests to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests346927148"
[controlplane] Wrote Static Pod manifest for component kube-apiserver to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests346927148/kube-apiserver.yaml"
[controlplane] Wrote Static Pod manifest for component kube-controller-manager to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests346927148/kube-controller-manager.yaml"
[controlplane] Wrote Static Pod manifest for component kube-scheduler to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests346927148/kube-scheduler.yaml"
[upgrade/staticpods] The etcd manifest will be restored if component "kube-apiserver" fails to upgrade
[certificates] Using the existing etcd/ca certificate and key.
[certificates] Using the existing apiserver-etcd-client certificate and key.
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-apiserver.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests190581659/kube-apiserver.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/apply] FATAL: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: [timed out waiting for the condition]

Upgrade v1.10.2 -> v1.10.2 (which may be nonsense. just testing...)

Ubuntu 16.04.

And, it fails with an error.

kubeadm upgrade apply v1.10.2

[upgrade/apply] FATAL: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: [timed out waiting for the condition]

I wonder if this is still tracked on some issue... couldn't find.

I'm also seeing upgrades still failing with the timed out waiting for the condition error.

Edit: Moved discussion to a new ticket https://github.com/kubernetes/kubeadm/issues/850, please discuss there.

If anyone else has this problem with 1.9.x:

If you are in aws with custom hostnames you need to edit the kubeadm-config configmap and set at nodeName the aws internal name: ip-xx-xx-xx-xx.$REGION.compute.internal)

kubectl -n kube-system edit cm kubeadm-config -oyaml

This besides setting etc client to http. I'm not yet on letter versions to see if they fixed that.

This is because kubeadm tries to read this path in api: /api/v1/namespaces/kube-system/pods/kube-apiserver-$NodeName

Since the timeout has been increased on 1.10.6, I've successfully updated my 1.9.7 deployment to 1.10.6 a couple weeks ago.

Planning to upgrade to 1.11.2 as soon as the .deb packages are ready as the same changes are in this version.

My cluster run on-premises on ARM64 boards.

Was this page helpful?
0 / 5 - 0 ratings