Kubeadm: 从1.9.6升级到1.10.0会因超时而失败

创建于 2018-03-28  ·  42评论  ·  资料来源: kubernetes/kubeadm

错误报告

版本号

kubeadm版本(使用kubeadm version ):

kubeadm版本:&version.Info {主要:“ 1”,次要:“ 10”,GitVersion:“ v1.10.0”,GitCommit:“ fc32d2f3698e36b93322a3465f63a14e9f0eaead”,GitTreeState:“ clean”,BuildDate:“ 2018-03-26T16:44: 10Z“,GoVersion:” go1.9.3“,编译器:” gc“,平台:” linux / amd64“}

环境

  • Kubernetes版本(使用kubectl version ):

客户端版本:version.Info {主要:“ 1”,次要:“ 9”,GitVersion:“ v1.9.6”,GitCommit:“ 9f8ebd171479bec0ada837d7ee641dec2f8c6dd1”,GitTreeState:“ clean”,BuildDate:“ 2018-03-21T15:21: 50Z“,GoVersion:” go1.9.3“,编译器:” gc“,平台:” linux / amd64“}
服务器版本:version.Info {主要:“ 1”,次要:“ 9”,GitVersion:“ v1.9.6”,GitCommit:“ 9f8ebd171479bec0ada837d7ee641dec2f8c6dd1”,GitTreeState:“ clean”,BuildDate:“ 2018-03-21T15:13: 31Z“,GoVersion:” go1.9.3“,编译器:” gc“,平台:” linux / amd64“}

  • 云提供商或硬件配置

Scaleway裸机C2S

  • 操作系统(例如,从/ etc / os-release):

Ubuntu Xenial(16.04 LTS)(GNU / Linux 4.4.122-mainline-rev1 x86_64)

  • 内核(例如uname -a ):

Linux amd64-master-1 4.4.122-mainline-rev1#1 SMP Sun Mar 18 10:44:19 UTC 2018 x86_64 x86_64 x86_64 GNU / Linux

发生了什么?

尝试从1.9.6升级到1.10.0时出现此错误:

kubeadm upgrade apply v1.10.0
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[upgrade/version] You have chosen to change the cluster version to "v1.10.0"
[upgrade/versions] Cluster version: v1.9.6
[upgrade/versions] kubeadm version: v1.10.0
[upgrade/confirm] Are you sure you want to proceed with the upgrade? [y/N]: y
[upgrade/prepull] Will prepull images for components [kube-apiserver kube-controller-manager kube-scheduler]
[upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.10.0"...
[etcd] Wrote Static Pod manifest for a local etcd instance to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests411909119/etcd.yaml"
[certificates] Generated etcd/ca certificate and key.
[certificates] Generated etcd/server certificate and key.
[certificates] etcd/server serving cert is signed for DNS names [localhost] and IPs [127.0.0.1]
[certificates] Generated etcd/peer certificate and key.
[certificates] etcd/peer serving cert is signed for DNS names [arm-master-1] and IPs [10.1.244.57]
[certificates] Generated etcd/healthcheck-client certificate and key.
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/etcd.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests180476754/etcd.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/apply] FATAL: fatal error when trying to upgrade the etcd cluster: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: [timed out waiting for the condition], rolled the state back to pre-upgrade state

您预期会发生什么?

成功升级

如何重现(尽可能少且精确)?

安装1.9.6软件包并初始化1.9.6集群:

curl -fsSL https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
echo "deb http://apt.kubernetes.io/ kubernetes-xenial main" | tee /etc/apt/sources.list.d/kubernetes.list
apt-get update -qq
apt-get install -qy kubectl=1.9.6-00
apt-get install -qy kubelet=1.9.6-00
apt-get install -qy kubeadm=1.9.6-00

编辑kubeadm-config并将FeatureGates从字符串更改为map,如https://github.com/kubernetes/kubernetes/issues/61764中所报告。

kubectl -n kube-system edit cm kubeadm-config

....
featureGates: {}
....

下载kubeadm 1.10.0并运行kubeadm upgrade plankubeadm upgrade apply v1.10.0

kinbug prioritcritical-urgent triaged

最有用的评论

临时解决方法是通过绕过检查来确保证书并升级etcd和apiserver吊舱。

确保检查您的配置并为您的用例添加任何标志:

kubectl -n kube-system edit cm kubeadm-config  # change featureFlags
...
  featureGates: {}
...
kubeadm alpha phase certs all
kubeadm alpha phase etcd local
kubeadm alpha phase controlplane all
kubeadm alpha phase upload-config

所有42条评论

致力于在本地重现此错误。

重试10次后,终于可以了

这是我的etcd清单差异
root @ vagrant :〜#diff /etc/kubernetes/manifests/etcd.yaml /etc/kubernetes/tmp/kubeadm-backup-manifests858209931/etcd.yaml
16,17c16,17
<---listen-client-urls = https://127.0.0.1:2379

<---advertise-client-urls = https://127.0.0.1:2379

- --listen-client-urls=http://127.0.0.1:2379
- --advertise-client-urls=http://127.0.0.1:2379

19,27c19
<---key-file = / etc / kubernetes / pki / etcd / server.key
<---trusted-ca-file = / etc / kubernetes / pki / etcd / ca.crt
<---peer-cert-file = / etc / kubernetes / pki / etcd / peer.crt
<---peer-key-file = / etc / kubernetes / pki / etcd / peer.key
<---client-cert-auth = true
<---peer-client-cert-auth = true
<---cert-file = / etc / kubernetes / pki / etcd / server.crt
<---peer-trusted-ca-file = / etc / kubernetes / pki / etcd / ca.crt

<图片:gcr.io/google_containers/etcd-amd64:3.1.12

image: gcr.io/google_containers/etcd-amd64:3.1.11

29,35天20
<执行:
<命令:
<-/ bin / sh
<--ec
<-ETCDCTL_API = 3 etcdctl --endpoints = 127.0.0.1:2379 --cacert = / etc / kubernetes / pki / etcd / ca.crt
<--cert = / etc / kubernetes / pki / etcd / healthcheck-client.crt --key = / etc / kubernetes / pki / etcd / healthcheck-client.key
<获取foo
36a22,26
httpGet:
主持人:127.0.0.1
路径:/健康
端口:2379
方案:HTTP
43,45c33
<名称:etcd-data
<-mountPath:/ etc / kubernetes / pki / etcd

<名称:etcd-certs

  name: etcd

51,55c39
<名称:etcd-data
<-hostPath:
<路径:/ etc / kubernetes / pki / etcd
<类型:DirectoryOrCreate

<名称:etcd-certs

name: etcd

root @ vagrant :〜#ls / etc / kubernetes / pki / etcd
ca.crt ca.key healthcheck-client.crt healthcheck-client.key peer.crt peer.key server.crt server.key``

Ubuntu 17.10 Vagrant上的1.9.6集群:

root<strong i="6">@vagrant</strong>:/vagrant# 1.10_kubernetes/server/bin/kubeadm upgrade apply v1.10.0
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[upgrade/version] You have chosen to change the cluster version to "v1.10.0"
[upgrade/versions] Cluster version: v1.9.6
[upgrade/versions] kubeadm version: v1.10.0
[upgrade/confirm] Are you sure you want to proceed with the upgrade? [y/N]: y
[upgrade/prepull] Will prepull images for components [kube-apiserver kube-controller-manager kube-scheduler]
[upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.10.0"...
[etcd] Wrote Static Pod manifest for a local etcd instance to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests262738652/etcd.yaml"
[certificates] Generated etcd/ca certificate and key.
[certificates] Generated etcd/server certificate and key.
[certificates] etcd/server serving cert is signed for DNS names [localhost] and IPs [127.0.0.1]
[certificates] Generated etcd/peer certificate and key.
[certificates] etcd/peer serving cert is signed for DNS names [vagrant] and IPs [10.0.2.15]
[certificates] Generated etcd/healthcheck-client certificate and key.
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/etcd.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests858209931/etcd.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[apiclient] Error getting Pods with label selector "component=etcd" [the server was unable to return a response in the time allotted, but may still be processing the request (get pods)]
[apiclient] Error getting Pods with label selector "component=etcd" [Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods?labelSelector=component%3Detcd: http2: server sent GOAWAY and closed the connection; LastStreamID=27, ErrCode=NO_ERROR, debug=""]
[apiclient] Error getting Pods with label selector "component=etcd" [Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods?labelSelector=component%3Detcd: net/http: TLS handshake timeout]
[apiclient] Error getting Pods with label selector "component=etcd" [the server was unable to return a response in the time allotted, but may still be processing the request (get pods)]
[apiclient] Error getting Pods with label selector "component=etcd" [Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods?labelSelector=component%3Detcd: http2: server sent GOAWAY and closed the connection; LastStreamID=3, ErrCode=NO_ERROR, debug=""]
[upgrade/apply] FATAL: fatal error when trying to upgrade the etcd cluster: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: [timed out waiting for the condition], rolled the state back to pre-upgrade state

这是我的复制环境: https :

将这些行更改为1.9.6-00进行引导: https :

然后将1.10服务器二进制文件下载到存储库中,它们将以/vagrant提供给来宾使用
https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.10.md#server -binaries

kubelet etcd相关的日志:

root<strong i="6">@vagrant</strong>:~# journalctl -xefu kubelet | grep -i etcd
Mar 28 16:32:07 vagrant kubelet[14676]: W0328 16:32:07.808776   14676 status_manager.go:459] Failed to get status for pod "etcd-vagrant_kube-system(7278f85057e8bf5cb81c9f96d3b25320)": Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 16:32:07 vagrant kubelet[14676]: I0328 16:32:07.880412   14676 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "etcd" (UniqueName: "kubernetes.io/host-path/7278f85057e8bf5cb81c9f96d3b25320-etcd") pod "etcd-vagrant" (UID: "7278f85057e8bf5cb81c9f96d3b25320")
Mar 28 16:34:27 vagrant kubelet[14676]: W0328 16:34:27.472534   14676 status_manager.go:459] Failed to get status for pod "etcd-vagrant_kube-system(7278f85057e8bf5cb81c9f96d3b25320)": Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 16:57:33 vagrant kubelet[14676]: W0328 16:57:33.683648   14676 kubelet.go:1597] Deleting mirror pod "etcd-vagrant_kube-system(122348c3-32a6-11e8-8dc5-080027d6be16)" because it is outdated
Mar 28 16:57:33 vagrant kubelet[14676]: I0328 16:57:33.725564   14676 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "etcd-certs" (UniqueName: "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-certs") pod "etcd-vagrant" (UID: "37936d2107e31b457cada6c2433469f1")
Mar 28 16:57:33 vagrant kubelet[14676]: I0328 16:57:33.725637   14676 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "etcd-data" (UniqueName: "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-data") pod "etcd-vagrant" (UID: "37936d2107e31b457cada6c2433469f1")
Mar 28 16:57:35 vagrant kubelet[14676]: E0328 16:57:35.484901   14676 kuberuntime_container.go:66] Can't make a ref to pod "etcd-vagrant_kube-system(7278f85057e8bf5cb81c9f96d3b25320)", container etcd: selfLink was empty, can't make reference
Mar 28 16:57:35 vagrant kubelet[14676]: I0328 16:57:35.889458   14676 reconciler.go:191] operationExecutor.UnmountVolume started for volume "etcd" (UniqueName: "kubernetes.io/host-path/7278f85057e8bf5cb81c9f96d3b25320-etcd") pod "7278f85057e8bf5cb81c9f96d3b25320" (UID: "7278f85057e8bf5cb81c9f96d3b25320")
Mar 28 16:57:35 vagrant kubelet[14676]: I0328 16:57:35.889595   14676 operation_generator.go:643] UnmountVolume.TearDown succeeded for volume "kubernetes.io/host-path/7278f85057e8bf5cb81c9f96d3b25320-etcd" (OuterVolumeSpecName: "etcd") pod "7278f85057e8bf5cb81c9f96d3b25320" (UID: "7278f85057e8bf5cb81c9f96d3b25320"). InnerVolumeSpecName "etcd". PluginName "kubernetes.io/host-path", VolumeGidValue ""
Mar 28 16:57:35 vagrant kubelet[14676]: I0328 16:57:35.989892   14676 reconciler.go:297] Volume detached for volume "etcd" (UniqueName: "kubernetes.io/host-path/7278f85057e8bf5cb81c9f96d3b25320-etcd") on node "vagrant" DevicePath ""
Mar 28 16:58:03 vagrant kubelet[14676]: E0328 16:58:03.688878   14676 mirror_client.go:88] Failed deleting a mirror pod "etcd-vagrant_kube-system": Timeout: request did not complete within allowed duration
Mar 28 16:58:03 vagrant kubelet[14676]: E0328 16:58:03.841447   14676 event.go:200] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"etcd-vagrant.152023ff626cfbc5", GenerateName:"", Namespace:"kube-system", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"etcd-vagrant", UID:"37936d2107e31b457cada6c2433469f1", APIVersion:"v1", ResourceVersion:"", FieldPath:""}, Reason:"SuccessfulMountVolume", Message:"MountVolume.SetUp succeeded for volume \"etcd-certs\" ", Source:v1.EventSource{Component:"kubelet", Host:"vagrant"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbea7103f713e59c5, ext:1534226953099, loc:(*time.Location)(0x5859e60)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbea7103f713e59c5, ext:1534226953099, loc:(*time.Location)(0x5859e60)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Timeout: request did not complete within allowed duration' (will not retry!)
Mar 28 16:58:33 vagrant kubelet[14676]: E0328 16:58:33.844276   14676 event.go:200] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"etcd-vagrant.152023ff626cfb82", GenerateName:"", Namespace:"kube-system", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"etcd-vagrant", UID:"37936d2107e31b457cada6c2433469f1", APIVersion:"v1", ResourceVersion:"", FieldPath:""}, Reason:"SuccessfulMountVolume", Message:"MountVolume.SetUp succeeded for volume \"etcd-data\" ", Source:v1.EventSource{Component:"kubelet", Host:"vagrant"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbea7103f713e5982, ext:1534226953033, loc:(*time.Location)(0x5859e60)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbea7103f713e5982, ext:1534226953033, loc:(*time.Location)(0x5859e60)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Timeout: request did not complete within allowed duration' (will not retry!)
Mar 28 16:59:03 vagrant kubelet[14676]: E0328 16:59:03.692450   14676 kubelet.go:1612] Failed creating a mirror pod for "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)": the server was unable to return a response in the time allotted, but may still be processing the request (post pods)
Mar 28 16:59:03 vagrant kubelet[14676]: E0328 16:59:03.848007   14676 event.go:200] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"etcd-vagrant.152023ff641f915f", GenerateName:"", Namespace:"kube-system", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"etcd-vagrant", UID:"7278f85057e8bf5cb81c9f96d3b25320", APIVersion:"v1", ResourceVersion:"", FieldPath:"spec.containers{etcd}"}, Reason:"Killing", Message:"Killing container with id docker://etcd:Need to kill Pod", Source:v1.EventSource{Component:"kubelet", Host:"vagrant"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbea7103f72f0ef5f, ext:1534255433999, loc:(*time.Location)(0x5859e60)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbea7103f72f0ef5f, ext:1534255433999, loc:(*time.Location)(0x5859e60)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Timeout: request did not complete within allowed duration' (will not retry!)
Mar 28 16:59:14 vagrant kubelet[14676]: W0328 16:59:14.472661   14676 kubelet.go:1597] Deleting mirror pod "etcd-vagrant_kube-system(122348c3-32a6-11e8-8dc5-080027d6be16)" because it is outdated
Mar 28 16:59:14 vagrant kubelet[14676]: W0328 16:59:14.473138   14676 status_manager.go:459] Failed to get status for pod "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)": Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 16:59:14 vagrant kubelet[14676]: E0328 16:59:14.473190   14676 mirror_client.go:88] Failed deleting a mirror pod "etcd-vagrant_kube-system": Delete https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 16:59:14 vagrant kubelet[14676]: E0328 16:59:14.473658   14676 kubelet.go:1612] Failed creating a mirror pod for "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)": Post https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 16:59:15 vagrant kubelet[14676]: W0328 16:59:15.481336   14676 kubelet.go:1597] Deleting mirror pod "etcd-vagrant_kube-system(122348c3-32a6-11e8-8dc5-080027d6be16)" because it is outdated
Mar 28 16:59:15 vagrant kubelet[14676]: E0328 16:59:15.483705   14676 mirror_client.go:88] Failed deleting a mirror pod "etcd-vagrant_kube-system": Delete https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 16:59:15 vagrant kubelet[14676]: E0328 16:59:15.497391   14676 kubelet.go:1612] Failed creating a mirror pod for "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)": Post https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 17:00:34 vagrant kubelet[14676]: W0328 17:00:34.475851   14676 kubelet.go:1597] Deleting mirror pod "etcd-vagrant_kube-system(122348c3-32a6-11e8-8dc5-080027d6be16)" because it is outdated
Mar 28 17:01:07 vagrant kubelet[14676]: W0328 17:01:07.720076   14676 status_manager.go:459] Failed to get status for pod "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)": Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: http2: server sent GOAWAY and closed the connection; LastStreamID=47, ErrCode=NO_ERROR, debug=""
Mar 28 17:01:07 vagrant kubelet[14676]: E0328 17:01:07.720107   14676 mirror_client.go:88] Failed deleting a mirror pod "etcd-vagrant_kube-system": Delete https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: http2: server sent GOAWAY and closed the connection; LastStreamID=47, ErrCode=NO_ERROR, debug=""; some request body already written
Mar 28 17:01:07 vagrant kubelet[14676]: E0328 17:01:07.725335   14676 kubelet.go:1612] Failed creating a mirror pod for "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)": Post https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 17:01:07 vagrant kubelet[14676]: I0328 17:01:07.728709   14676 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "etcd" (UniqueName: "kubernetes.io/host-path/7278f85057e8bf5cb81c9f96d3b25320-etcd") pod "etcd-vagrant" (UID: "7278f85057e8bf5cb81c9f96d3b25320")
Mar 28 17:01:07 vagrant kubelet[14676]: W0328 17:01:07.734475   14676 status_manager.go:459] Failed to get status for pod "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)": Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 17:01:07 vagrant kubelet[14676]: W0328 17:01:07.740642   14676 status_manager.go:459] Failed to get status for pod "etcd-vagrant_kube-system(7278f85057e8bf5cb81c9f96d3b25320)": Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 17:01:09 vagrant kubelet[14676]: E0328 17:01:09.484412   14676 kuberuntime_container.go:66] Can't make a ref to pod "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)", container etcd: selfLink was empty, can't make reference
Mar 28 17:01:09 vagrant kubelet[14676]: I0328 17:01:09.848794   14676 reconciler.go:191] operationExecutor.UnmountVolume started for volume "etcd-certs" (UniqueName: "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-certs") pod "37936d2107e31b457cada6c2433469f1" (UID: "37936d2107e31b457cada6c2433469f1")
Mar 28 17:01:09 vagrant kubelet[14676]: I0328 17:01:09.849282   14676 reconciler.go:191] operationExecutor.UnmountVolume started for volume "etcd-data" (UniqueName: "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-data") pod "37936d2107e31b457cada6c2433469f1" (UID: "37936d2107e31b457cada6c2433469f1")
Mar 28 17:01:09 vagrant kubelet[14676]: I0328 17:01:09.849571   14676 operation_generator.go:643] UnmountVolume.TearDown succeeded for volume "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-data" (OuterVolumeSpecName: "etcd-data") pod "37936d2107e31b457cada6c2433469f1" (UID: "37936d2107e31b457cada6c2433469f1"). InnerVolumeSpecName "etcd-data". PluginName "kubernetes.io/host-path", VolumeGidValue ""
Mar 28 17:01:09 vagrant kubelet[14676]: I0328 17:01:09.849503   14676 operation_generator.go:643] UnmountVolume.TearDown succeeded for volume "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-certs" (OuterVolumeSpecName: "etcd-certs") pod "37936d2107e31b457cada6c2433469f1" (UID: "37936d2107e31b457cada6c2433469f1"). InnerVolumeSpecName "etcd-certs". PluginName "kubernetes.io/host-path", VolumeGidValue ""
Mar 28 17:01:09 vagrant kubelet[14676]: I0328 17:01:09.949925   14676 reconciler.go:297] Volume detached for volume "etcd-certs" (UniqueName: "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-certs") on node "vagrant" DevicePath ""
Mar 28 17:01:09 vagrant kubelet[14676]: I0328 17:01:09.949975   14676 reconciler.go:297] Volume detached for volume "etcd-data" (UniqueName: "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-data") on node "vagrant" DevicePath ""

当前的解决方法是继续重试升级,并且在某个时候它将成功。

@stealthybox您是否碰巧从etcd容器的docker中获取日志? 同样, grep -i etcd可能掩盖了一些kubelet输出,例如一些错误消息,其中没有容器名称,但仍然有用。

我只是碰到了另一个与此错误有关的怪异案例。 在拉出新的etcd映像和部署新的静态Pod之前,kubeadm升级将etcd升级标记为已完成。 这会导致升级在以后的步骤中超时,并且升级回滚失败。 这也使群集处于损坏状态。 需要恢复原始的etcd静态pod清单以恢复集群。

哦,是的,我也被困在那里。 我的集群完全崩溃了。 有人可以分享一些有关如何从此状态中解救的说明吗?

就像@detiber所描述的那样,我第二次尝试升级时就在

在/ etc / kubernetes / tmp中找到了一些备份的东西,感觉到etcd可能是罪魁祸首,我将其旧清单复制到manifests文件夹中的新清单上。 那时我没有什么可失去的,因为我完全失去了对集群的控制。 然后,我记不清了,但是我想我重新启动了整个计算机,后来又将所有内容降级到v1.9.6。 最终,我获得了对该群集的控制权,然后失去了再次使用v1.10.0的动机。 一点都不好玩...

如果从/etc/kubernetes/tmp回滚etcd静态Pod清单,由于1.10中新的TLS配置,还必须将apiserver清单回滚到1.9版本,这一点很重要。

^但是您可能不需要这样做,因为我相信etcd升级会阻止其余的控制平面升级。

似乎只有etcd清单不会在升级失败时回滚,其他一切都很好。 将备份清单移至上方并重新启动kubelet之后,一切恢复正常。

我遇到了同样的超时问题,并且kubeadm将kube-apiserv清单回滚到1.9.6,但保留了etcd清单(读取:启用了TLS),显然导致apiserv失败,有效地破坏了我的主节点。 我认为是单独发布问题报告的好人选。

@dvdmuckle @codepainters ,不幸的是,回滚是否成功取决于哪个组件符合竞争条件(etcd或api服务器)。 我找到了解决比赛条件的方法,但是它完全破坏了kubeadm的升级。 我正在与@stealthybox合作,尝试找到正确解决升级问题的正确方法。

@codepainters我认为这是相同的问题。

有一些潜在的问题导致此问题:

  • 升级是根据从API查询镜像吊舱的结果为每个组件生成镜像吊舱的哈希值。 然后,升级将进行测试以查看此哈希值是否更改,以确定是否从静态清单更改中更新了Pod。 散列值包括由于静态清单变更(例如吊舱状态更新)以外的原因而可能导致突变的字段。 如果在哈希比较之间容器状态发生变化,则升级将过早地继续到下一个组件。
  • 升级将执行etcd静态Pod清单更新(包括向etcd添加tls安全性),并尝试使用apiserver来验证Pod已更新,但是此时apiserver清单尚未更新,无法使用tls与etcd通信。

结果,仅当碰巧有etcd pod的pod状态更新导致哈希值在kubelet拾取etcd的新静态清单之前更改时,升级才当前成功。 此外,当升级工具在更新apiserver清单之前查询api时,api服务器需要在apiserver升级的第一部分中保持可用状态。

@detiber和我通电话讨论了我们需要对升级过程进行的更改。
我们计划在1.10.x修补程序版本中针对此错误实施3个修复程序:

  • 从升级中删除etcd TLS。
    当前的升级循环以串行方式对每个组件进行批量修改。
    升级组件不了解相关的组件配置。
    验证升级要求APIServer可用于检查容器状态。
    Etcd TLS需要更改etcd + apiserver的配置,这会破坏此合同。
    这是解决此问题的最小可行更改,并使升级后的群集具有不安全的etcd。

  • 修正Pod状态更改上的mirror-pod哈希竞赛条件。
    https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/phases/upgrade/staticpods.go#L189。
    假设etcd和apiserver标志之间具有兼容性,那么升级现在将是正确的。

  • 专门在单独的阶段中升级TLS。
    Etcd和APIServer需要一起升级。
    kubeadm alpha phase ensure-etcd-tls ?。
    此阶段应可独立于集群升级运行。
    在群集升级期间,应在更新所有组件之前运行此阶段。


对于1.11,我们要:

  • 使用kubelet API进行升级后的静态Pod的运行时检查。
    像我们目前所做的那样,依靠apiserver和etcd监视本地进程是不可取的。
    有关Pod的本地数据源优于依赖于高阶分布式kubernetes组件。
    这将替换升级循环中当前的pod运行时检查。
    这将使我们能够将检查添加到sure-etcd-tls阶段。

另外一种方法:使用CRI获取广告连播信息(使用crictl演示可行)。
警告:dockershim上的CRI和可能的其他容器运行时当前不支持CRI重大更改的向后兼容性。

去做:

  • []打开并链接这4个更改的问题。

PR来解决静态Pod更新竞赛条件: https :
版本1.10分支的cherry-pick PR: https :

@detiber ,您介意解释我们在谈论什么比赛条件吗? 我对kubeadm内部构件并不熟悉,但这听起来很有趣。

仅供参考-从1.9.3升级相同的问题/问题
尝试了多次重试的解决方法。 最终达到了带有API服务器的竞争条件,并且升级无法回滚。

@stealthybox thx,我在初读时没听懂。

我遇到了同样的问题。.[ERROR APIServerHealth]:API服务器不健康; / healthz没有返回“确定”
[错误MasterNodesReady]:无法列出群集中的主服务器:升级时获取https.......。 请帮我解决一下这个。 我正在从1.9.3升级到1.10.0。 最初,它能够到达“ [等待升级/静态脚架]等待kubelet重新启动组件”的某个点。

临时解决方法是通过绕过检查来确保证书并升级etcd和apiserver吊舱。

确保检查您的配置并为您的用例添加任何标志:

kubectl -n kube-system edit cm kubeadm-config  # change featureFlags
...
  featureGates: {}
...
kubeadm alpha phase certs all
kubeadm alpha phase etcd local
kubeadm alpha phase controlplane all
kubeadm alpha phase upload-config

谢谢@stealthybox
对我来说, upgrade apply进程停在了[upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.10.1"...但是集群已成功升级。

@stealthybox我不确定,但是在这些步骤之后似乎出现了问题,因为在此之后 kubeadm upgrade plan会挂起:

[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[upgrade] Fetching available versions to upgrade to
[upgrade/versions] Cluster version: v1.10.1
[upgrade/versions] kubeadm version: v1.10.1
[upgrade/versions] Latest stable version: v1.10.1

应用更新时,我也挂了[upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.10.1"...

@kvaps @stealthybox这很可能是etcd问题( kubeadm对启用TLS的etcd说的是简单的HTTP/2 etcd ),我也遇到了。 参见其他问题: https :

老实说,我不明白为什么TLS和非TLS etcd侦听器都使用相同的TCP端口,只会引起类似这样的麻烦。 简单地说,旧的_拒绝连接会立即给出提示,在这里,我不得不求助于tcpdump来了解发生了什么。

哦!
拍对了,那只适用于我的本地TLS补丁进行Etcd状态检查。

执行以下操作完成升级:

kubeadm alpha phase controlplane all
kubeadm alpha phase upload-config

编辑上述解决方法是正确的

@stealthybox第二个kubeadm命令不起作用:

# kubeadm alpha phase upload-config
The --config flag is mandatory

@renich只是给它配置的文件路径

如果您不使用任何自定义设置,则可以向其传递一个空文件。
这是在bash中执行此操作的一种简单方法:

1.10_kubernetes/server/bin/kubeadm alpha phase upload-config --config <(echo)

现在应该通过合并https://github.com/kubernetes/kubernetes/pull/62655解决此问题,并将成为v1.10.2版本的一部分。

我可以确认使用kubeadm 1.10.2进行的1.10.0-> 1.10.2升级是顺利的,没有超时

我仍然在1.10.0-> 1.10.2上有超时,但是另外一个:
[upgrade/staticpods] Waiting for the kubelet to restart the component Static pod: kube-apiserver-master hash: a273591d3207fcd9e6fd0c308cc68d64 [upgrade/apply] FATAL: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: [timed out waiting for the condition]

我不确定该怎么办...

@ denis111使用docker ps进行升级时检查API服务器日志。 我觉得您可能遇到了我也遇到的问题。

@dvdmuckle好吧,我在该日志中没有看到任何错误,只有以I和几个W开头的条目。
而且我认为kube-apiserver的哈希值在升级过程中不会改变。

我在1.9.3上有一个ARM64群集,已成功更新到1.9.7,但从1.9.7升级到1.10.2时也遇到了相同的超时问题。

我什至尝试编辑和重新编译kubeadm以增加超时(如这些最后的提交https://github.com/anguslees/kubernetes/commits/kubeadm-gusfork),结果相同。

$ sudo kubeadm upgrade apply  v1.10.2 --force
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[upgrade/version] You have chosen to change the cluster version to "v1.10.2"
[upgrade/versions] Cluster version: v1.9.7
[upgrade/versions] kubeadm version: v1.10.2-dirty
[upgrade/version] Found 1 potential version compatibility errors but skipping since the --force flag is set:

   - Specified version to upgrade to "v1.10.2" is higher than the kubeadm version "v1.10.2-dirty". Upgrade kubeadm first using the tool you used to install kubeadm
[upgrade/prepull] Will prepull images for components [kube-apiserver kube-controller-manager kube-scheduler]
[upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.10.2"...
Static pod: kube-apiserver-kubemaster1 hash: ed7578d5bf9314188dca798386bcfb0e
Static pod: kube-controller-manager-kubemaster1 hash: e0c3f578f1c547dcf9996e1d3390c10c
Static pod: kube-scheduler-kubemaster1 hash: 52e767858f52ac4aba448b1a113884ee
[upgrade/etcd] Upgrading to TLS for etcd
Static pod: etcd-kubemaster1 hash: 413224efa82e36533ce93e30bd18e3a8
[etcd] Wrote Static Pod manifest for a local etcd instance to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests346927148/etcd.yaml"
[certificates] Using the existing etcd/ca certificate and key.
[certificates] Using the existing etcd/server certificate and key.
[certificates] Using the existing etcd/peer certificate and key.
[certificates] Using the existing etcd/healthcheck-client certificate and key.
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/etcd.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests190581659/etcd.yaml"
[upgrade/staticpods] Not waiting for pod-hash change for component "etcd"
[upgrade/etcd] Waiting for etcd to become available
[util/etcd] Waiting 30s for initial delay
[util/etcd] Attempting to get etcd status 1/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: getsockopt: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to get etcd status 2/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: getsockopt: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to get etcd status 3/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: getsockopt: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to get etcd status 4/10
[upgrade/staticpods] Writing new Static Pod manifests to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests346927148"
[controlplane] Wrote Static Pod manifest for component kube-apiserver to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests346927148/kube-apiserver.yaml"
[controlplane] Wrote Static Pod manifest for component kube-controller-manager to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests346927148/kube-controller-manager.yaml"
[controlplane] Wrote Static Pod manifest for component kube-scheduler to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests346927148/kube-scheduler.yaml"
[upgrade/staticpods] The etcd manifest will be restored if component "kube-apiserver" fails to upgrade
[certificates] Using the existing etcd/ca certificate and key.
[certificates] Using the existing apiserver-etcd-client certificate and key.
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-apiserver.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests190581659/kube-apiserver.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/apply] FATAL: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: [timed out waiting for the condition]

升级v1.10.2-> v1.10.2(这可能是废话。只需测试...)

Ubuntu 16.04。

并且,它失败并显示错误。

kubeadm upgrade apply v1.10.2

[upgrade/apply] FATAL: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: [timed out waiting for the condition]

我想知道是否在某些问题上仍在跟踪……找不到。

我还看到升级仍然失败,并出现timed out waiting for the condition错误。

编辑:将讨论移至新票证https://github.com/kubernetes/kubeadm/issues/850 ,请在此处进行讨论。

如果其他任何人对1.9.x都有此问题,请执行以下操作:

如果您使用的是具有自定义主机名的AWS,则需要编辑kubeadm-config configmap并在nodeName上设置AWS内部名称:ip-xx-xx-xx-xx。$ REGION.compute.internal)

kubectl -n kube-system edit cm kubeadm-config -oyaml

这除了将etc客户端设置为http。 我还没来信来看看他们是否解决了这个问题。

这是因为kubeadm尝试在api中读取此路径:/ api / v1 / namespaces / kube-system / pods / kube-apiserver- $ NodeName

由于超时已在1.10.6上增加,所以几周前我已将1.9.7部署成功更新为1.10.6。

计划在.deb软件包准备就绪后立即升级到1.11.2,因为此版本中的更改相同。

我的集群在ARM64板上本地运行。

此页面是否有帮助?
0 / 5 - 0 等级