BUG REPORT
kubeadm version 1.11
Environment:
kubectl version
): 1.11uname -a
): 3.10.0-693.17.1.el7.x86_64after kubeadm init the coreos pods stay in Error
NAME READY STATUS RESTARTS AGE
coredns-78fcdf6894-ljdjp 0/1 Error 6 9m
coredns-78fcdf6894-p6flm 0/1 Error 6 9m
etcd-master 1/1 Running 0 8m
heapster-5bbdfbff9f-h5h2n 1/1 Running 0 9m
kube-apiserver-master 1/1 Running 0 8m
kube-controller-manager-master 1/1 Running 0 8m
kube-proxy-5642r 1/1 Running 0 9m
kube-scheduler-master 1/1 Running 0 8m
kubernetes-dashboard-6948bdb78-bwkvx 1/1 Running 0 9m
weave-net-r5jkg 2/2 Running 0 9m
The logs of both pods show the following:
standard_init_linux.go:178: exec user process caused "operation not permitted"
@kubernetes/sig-network-bugs
@carlosmkb, what is your docker version?
I find this hard to believe, we pretty extensively test CentOS 7 on our side.
Do you have the system and pod logs?
@dims , can make sense, I will try
@neolit123 and @timothysc
docker version: docker-1.13.1-63.git94f4240.el7.centos.x86_64
coredns pods log: standard_init_linux.go:178: exec user process caused "operation not permitted"
system log journalctl -xeu kubelet
:
Jul 17 23:45:17 server.raid.local kubelet[20442]: E0717 23:45:17.679867 20442 pod_workers.go:186] Error syncing pod dd030886-89f4-11e8-9786-0a92797fa29e ("cas-7d6d97c7bd-mzw5j_raidcloud(dd030886-89f4-11e8-9786-0a92797fa29e)"), skipping: failed to "StartContainer" for "cas" with ImagePullBackOff: "Back-off pulling image \"registry.raidcloud.io/raidcloud/cas:180328.pvt.01\""
Jul 17 23:45:18 server.raid.local kubelet[20442]: I0717 23:45:18.679059 20442 kuberuntime_manager.go:513] Container {Name:json2ldap Image:registry.raidcloud.io/raidcloud/json2ldap:180328.pvt.01 Command:[] Args:[] WorkingDir: Ports:[] EnvFrom:[] Env:[] Resources:{Limits:map[] Requests:map[]} VolumeMounts:[{Name:default-token-f2cmq ReadOnly:true MountPath:/var/run/secrets/kubernetes.io/serviceaccount SubPath: MountPropagation:<nil>}] VolumeDevices:[] LivenessProbe:&Probe{Handler:Handler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/,Port:8080,Host:,Scheme:HTTP,HTTPHeaders:[],},TCPSocket:nil,},InitialDelaySeconds:30,TimeoutSeconds:5,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,} ReadinessProbe:nil Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:IfNotPresent SecurityContext:&SecurityContext{Capabilities:nil,Privileged:*true,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,RunAsGroup:nil,} Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.
Jul 17 23:45:18 server.raid.local kubelet[20442]: E0717 23:45:18.680001 20442 pod_workers.go:186] Error syncing pod dcc39ce2-89f4-11e8-9786-0a92797fa29e ("json2ldap-666fc85686-tmxrr_raidcloud(dcc39ce2-89f4-11e8-9786-0a92797fa29e)"), skipping: failed to "StartContainer" for "json2ldap" with ImagePullBackOff: "Back-off pulling image \"registry.raidcloud.io/raidcloud/json2ldap:180328.pvt.01\""
Jul 17 23:45:21 server.raid.local kubelet[20442]: I0717 23:45:21.678232 20442 kuberuntime_manager.go:513] Container {Name:coredns Image:k8s.gcr.io/coredns:1.1.3 Command:[] Args:[-conf /etc/coredns/Corefile] WorkingDir: Ports:[{Name:dns HostPort:0 ContainerPort:53 Protocol:UDP HostIP:} {Name:dns-tcp HostPort:0 ContainerPort:53 Protocol:TCP HostIP:} {Name:metrics HostPort:0 ContainerPort:9153 Protocol:TCP HostIP:}] EnvFrom:[] Env:[] Resources:{Limits:map[memory:{i:{value:178257920 scale:0} d:{Dec:<nil>} s:170Mi Format:BinarySI}] Requests:map[cpu:{i:{value:100 scale:-3} d:{Dec:<nil>} s:100m Format:DecimalSI} memory:{i:{value:73400320 scale:0} d:{Dec:<nil>} s:70Mi Format:BinarySI}]} VolumeMounts:[{Name:config-volume ReadOnly:true MountPath:/etc/coredns SubPath: MountPropagation:<nil>} {Name:coredns-token-6nhgg ReadOnly:true MountPath:/var/run/secrets/kubernetes.io/serviceaccount SubPath: MountPropagation:<nil>}] VolumeDevices:[] LivenessProbe:&Probe{Handler:Handler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/health,Port:8080,Host:,Scheme:HTTP,HTTPHeaders:[],},TCPSocket:nil,},InitialDelaySeconds:60,TimeoutSeconds:5,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:5,} ReadinessProbe:nil Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:IfNotPresent SecurityContext:&SecurityContext{Capabilities:&Capabilities{Add:[NET_BIND_SERVICE],Drop:[all],},Privileged:nil,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:*true,AllowPrivilegeEscalation:*false,RunAsGroup:nil,} Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.
Jul 17 23:45:21 server.raid.local kubelet[20442]: I0717 23:45:21.678311 20442 kuberuntime_manager.go:757] checking backoff for container "coredns" in pod "coredns-78fcdf6894-znfvw_kube-system(9b44aa92-89f7-11e8-9786-0a92797fa29e)"
Jul 17 23:45:21 server.raid.local kubelet[20442]: I0717 23:45:21.678404 20442 kuberuntime_manager.go:767] Back-off 5m0s restarting failed container=coredns pod=coredns-78fcdf6894-znfvw_kube-system(9b44aa92-89f7-11e8-9786-0a92797fa29e)
Jul 17 23:45:21 server.raid.local kubelet[20442]: E0717 23:45:21.678425 20442 pod_workers.go:186] Error syncing pod 9b44aa92-89f7-11e8-9786-0a92797fa29e ("coredns-78fcdf6894-znfvw_kube-system(9b44aa92-89f7-11e8-9786-0a92797fa29e)"), skipping: failed to "StartContainer" for "coredns" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=coredns pod=coredns-78fcdf6894-znfvw_kube-system(9b44aa92-89f7-11e8-9786-0a92797fa29e)"
Jul 17 23:45:22 server.raid.local kubelet[20442]: I0717 23:45:22.679145 20442 kuberuntime_manager.go:513] Container {Name:login Image:registry.raidcloud.io/raidcloud/admin:180329.pvt.05 Command:[] Args:[] WorkingDir: Ports:[] EnvFrom:[] Env:[] Resources:{Limits:map[] Requests:map[]} VolumeMounts:[{Name:login-config ReadOnly:true MountPath:/usr/share/nginx/conf/ SubPath: MountPropagation:<nil>} {Name:default-token-f2cmq ReadOnly:true MountPath:/var/run/secrets/kubernetes.io/serviceaccount SubPath: MountPropagation:<nil>}] VolumeDevices:[] LivenessProbe:&Probe{Handler:Handler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/health,Port:8080,Host:,Scheme:HTTP,HTTPHeaders:[],},TCPSocket:nil,},InitialDelaySeconds:5,TimeoutSeconds:5,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,} ReadinessProbe:nil Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:IfNotPresent SecurityContext:&SecurityContext{Capabilities:nil,Privileged:*true,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,RunAsGroup:nil,} Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.
Jul 17 23:45:22 server.raid.local kubelet[20442]: E0717 23:45:22.679941 20442 pod_workers.go:186] Error syncing pod dc8392a9-89f4-11e8-9786-0a92797fa29e ("login-85ffb66bb8-5l9fq_raidcloud(dc8392a9-89f4-11e8-9786-0a92797fa29e)"), skipping: failed to "StartContainer" for "login" with ImagePullBackOff: "Back-off pulling image \"registry.raidcloud.io/raidcloud/admin:180329.pvt.05\""
Jul 17 23:45:23 server.raid.local kubelet[20442]: I0717 23:45:23.678172 20442 kuberuntime_manager.go:513] Container {Name:coredns Image:k8s.gcr.io/coredns:1.1.3 Command:[] Args:[-conf /etc/coredns/Corefile] WorkingDir: Ports:[{Name:dns HostPort:0 ContainerPort:53 Protocol:UDP HostIP:} {Name:dns-tcp HostPort:0 ContainerPort:53 Protocol:TCP HostIP:} {Name:metrics HostPort:0 ContainerPort:9153 Protocol:TCP HostIP:}] EnvFrom:[] Env:[] Resources:{Limits:map[memory:{i:{value:178257920 scale:0} d:{Dec:<nil>} s:170Mi Format:BinarySI}] Requests:map[cpu:{i:{value:100 scale:-3} d:{Dec:<nil>} s:100m Format:DecimalSI} memory:{i:{value:73400320 scale:0} d:{Dec:<nil>} s:70Mi Format:BinarySI}]} VolumeMounts:[{Name:config-volume ReadOnly:true MountPath:/etc/coredns SubPath: MountPropagation:<nil>} {Name:coredns-token-6nhgg ReadOnly:true MountPath:/var/run/secrets/kubernetes.io/serviceaccount SubPath: MountPropagation:<nil>}] VolumeDevices:[] LivenessProbe:&Probe{Handler:Handler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/health,Port:8080,Host:,Scheme:HTTP,HTTPHeaders:[],},TCPSocket:nil,},InitialDelaySeconds:60,TimeoutSeconds:5,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:5,} ReadinessProbe:nil Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:IfNotPresent SecurityContext:&SecurityContext{Capabilities:&Capabilities{Add:[NET_BIND_SERVICE],Drop:[all],},Privileged:nil,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:*true,AllowPrivilegeEscalation:*false,RunAsGroup:nil,} Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.
Jul 17 23:45:23 server.raid.local kubelet[20442]: I0717 23:45:23.678412 20442 kuberuntime_manager.go:757] checking backoff for container "coredns" in pod "coredns-78fcdf6894-lcqt5_kube-system(9b45a068-89f7-11e8-9786-0a92797fa29e)"
Jul 17 23:45:23 server.raid.local kubelet[20442]: I0717 23:45:23.678532 20442 kuberuntime_manager.go:767] Back-off 5m0s restarting failed container=coredns pod=coredns-78fcdf6894-lcqt5_kube-system(9b45a068-89f7-11e8-9786-0a92797fa29e)
Jul 17 23:45:23 server.raid.local kubelet[20442]: E0717 23:45:23.678554 20442 pod_workers.go:186] Error syncing pod 9b45a068-89f7-11e8-9786-0a92797fa29e ("coredns-78fcdf6894-lcqt5_kube-system(9b45a068-89f7-11e8-9786-0a92797fa29e)"), skipping: failed to "StartContainer" for "coredns" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=coredns pod=coredns-78fcdf6894-lcqt5_kube-system(9b45a068-89f7-11e8-9786-0a92797fa29e)"
Found a couple instances of the same errors reported in other scenarios in the past.
Might try removing "allowPrivilegeEscalation: false" from the CoreDNS deployment to see if that helps.
Same issue for me. Similar setup CentOS 7.4.1708, Docker version 1.13.1, build 94f4240/1.13.1 (comes with CentOS):
[root@faas-A01 ~]# kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-node-2vssv 2/2 Running 0 9m
kube-system calico-node-4vr7t 2/2 Running 0 7m
kube-system calico-node-nlfnd 2/2 Running 0 17m
kube-system calico-node-rgw5w 2/2 Running 0 23m
kube-system coredns-78fcdf6894-p4wbl 0/1 CrashLoopBackOff 9 30m
kube-system coredns-78fcdf6894-r4pwf 0/1 CrashLoopBackOff 9 30m
kube-system etcd-faas-a01.sl.cloud9.ibm.com 1/1 Running 0 29m
kube-system kube-apiserver-faas-a01.sl.cloud9.ibm.com 1/1 Running 0 29m
kube-system kube-controller-manager-faas-a01.sl.cloud9.ibm.com 1/1 Running 0 29m
kube-system kube-proxy-55csj 1/1 Running 0 17m
kube-system kube-proxy-56r8c 1/1 Running 0 30m
kube-system kube-proxy-kncql 1/1 Running 0 9m
kube-system kube-proxy-mf2bp 1/1 Running 0 7m
kube-system kube-scheduler-faas-a01.sl.cloud9.ibm.com 1/1 Running 0 29m
[root@faas-A01 ~]# kubectl logs --namespace=all coredns-78fcdf6894-p4wbl
Error from server (NotFound): namespaces "all" not found
[root@faas-A01 ~]# kubectl logs --namespace=kube-system coredns-78fcdf6894-p4wbl
standard_init_linux.go:178: exec user process caused "operation not permitted"
just in case, selinux is in permissive mode on all nodes.
An I'm using Calico (not weave as @carlosmkb).
[root@faas-A01 ~]# kubectl logs --namespace=kube-system coredns-78fcdf6894-p4wbl
standard_init_linux.go:178: exec user process caused "operation not permitted"
Ah - This is an error from kubectl when trying to get the logs, not the contents of the logs...
@chrisohaver the kubectl logs
works with another kube-system pods
OK - have you tried removing "allowPrivilegeEscalation: false" from the CoreDNS deployment to see if that helps?
... does a kubectl describe
of the coredns pod show anything interesting?
Same issue for me.
CentOS Linux release 7.5.1804 (Core)
Docker version 1.13.1, build dded712/1.13.1
flannel as cni add-on
[root@k8s ~]# kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-78fcdf6894-cfmm7 0/1 CrashLoopBackOff 12 15m
kube-system coredns-78fcdf6894-k65js 0/1 CrashLoopBackOff 11 15m
kube-system etcd-k8s.master 1/1 Running 0 14m
kube-system kube-apiserver-k8s.master 1/1 Running 0 13m
kube-system kube-controller-manager-k8s.master 1/1 Running 0 14m
kube-system kube-flannel-ds-fts6v 1/1 Running 0 14m
kube-system kube-proxy-4tdb5 1/1 Running 0 15m
kube-system kube-scheduler-k8s.master 1/1 Running 0 14m
[root@k8s ~]# kubectl logs coredns-78fcdf6894-cfmm7 -n kube-system
standard_init_linux.go:178: exec user process caused "operation not permitted"
[root@k8s ~]# kubectl describe pods coredns-78fcdf6894-cfmm7 -n kube-system
Name: coredns-78fcdf6894-cfmm7
Namespace: kube-system
Node: k8s.master/192.168.150.40
Start Time: Fri, 27 Jul 2018 00:32:09 +0800
Labels: k8s-app=kube-dns
pod-template-hash=3497892450
Annotations: <none>
Status: Running
IP: 10.244.0.12
Controlled By: ReplicaSet/coredns-78fcdf6894
Containers:
coredns:
Container ID: docker://3b7670fbc07084410984d7e3f8c0fa1b6d493a41d2a4e32f5885b7db9d602417
Image: k8s.gcr.io/coredns:1.1.3
Image ID: docker-pullable://k8s.gcr.io/coredns@sha256:db2bf53126ed1c761d5a41f24a1b82a461c85f736ff6e90542e9522be4757848
Ports: 53/UDP, 53/TCP, 9153/TCP
Host Ports: 0/UDP, 0/TCP, 0/TCP
Args:
-conf
/etc/coredns/Corefile
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Fri, 27 Jul 2018 00:46:30 +0800
Finished: Fri, 27 Jul 2018 00:46:30 +0800
Ready: False
Restart Count: 12
Limits:
memory: 170Mi
Requests:
cpu: 100m
memory: 70Mi
Liveness: http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
Environment: <none>
Mounts:
/etc/coredns from config-volume (ro)
/var/run/secrets/kubernetes.io/serviceaccount from coredns-token-vqslm (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: coredns
Optional: false
coredns-token-vqslm:
Type: Secret (a volume populated by a Secret)
SecretName: coredns-token-vqslm
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: CriticalAddonsOnly
node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 16m (x6 over 16m) default-scheduler 0/1 nodes are available: 1 node(s) were not ready.
Normal Scheduled 16m default-scheduler Successfully assigned kube-system/coredns-78fcdf6894-cfmm7 to k8s.master
Warning BackOff 14m (x10 over 16m) kubelet, k8s.master Back-off restarting failed container
Normal Pulled 14m (x5 over 16m) kubelet, k8s.master Container image "k8s.gcr.io/coredns:1.1.3" already present on machine
Normal Created 14m (x5 over 16m) kubelet, k8s.master Created container
Normal Started 14m (x5 over 16m) kubelet, k8s.master Started container
Normal Pulled 11m (x4 over 12m) kubelet, k8s.master Container image "k8s.gcr.io/coredns:1.1.3" already present on machine
Normal Created 11m (x4 over 12m) kubelet, k8s.master Created container
Normal Started 11m (x4 over 12m) kubelet, k8s.master Started container
Warning BackOff 2m (x56 over 12m) kubelet, k8s.master Back-off restarting failed container
[root@k8s ~]# uname
Linux
[root@k8s ~]# uname -a
Linux k8s.master 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
[root@k8s ~]# cat /etc/redhat-release
CentOS Linux release 7.5.1804 (Core)
[root@k8s ~]# docker --version
Docker version 1.13.1, build dded712/1.13.1
I have the same issue when selinux is in persmissive mode. When I disable it in /etc/selinux/conf SELINUX=disabled and reboot machine the pod starts up.
Redhat 7.4, kernel 3.10.0-693.11.6.el7.x86_64
docker-1.13.1-68.gitdded712.el7.x86_64
FYI, Also works for me with SELinux disabled (not permissive, but _disabled_).
Docker version 1.13.1, build dded712/1.13.1
CentOS 7
[root@centosk8s ~]# kubectl logs coredns-78fcdf6894-rhx9p -n kube-system
.:53
CoreDNS-1.1.3
linux/amd64, go1.10.1, b0fd575c
2018/07/27 16:37:31 [INFO] CoreDNS-1.1.3
2018/07/27 16:37:31 [INFO] linux/amd64, go1.10.1, b0fd575c
2018/07/27 16:37:31 [INFO] plugin/reload: Running configuration MD5 = 2a066f12ec80aeb2b92740dd74c17138
We are also experiencing this issue, we provision infrastructure through automation, so requiring a restart to completely disable selinux is not acceptable. Are there any other workarounds why we wait for this to be fixed?
Try removing "allowPrivilegeEscalation: false" from the CoreDNS deployment to see if that helps.
Updating to a more recent version of docker (than 1.13) may also help.
Same issue here
Docker version 1.2.6
CentOS 7
like @lareeth we also provision kubernetes automation on use kubeadm, and also requiring a restart to completely disable selinux is not acceptable.
@chrisohaver try requiring a restart to completely disable selinux is not acceptable. it use helpful. thank you !
But as i know coredns options is not setting in kubeadm configuration
Is there no other way?
Try removing "allowPrivilegeEscalation: false" from the CoreDNS deployment to see if that helps.
Updating to a more recent version of docker (e.g. to the version recommended by k8s) may also help.
I verified that removing "allowPrivilegeEscalation: false" from the coredns deployment resolves the issue (with SE linux enabled in permissive mode).
I also verified that upgrading to a version of docker recommended by Kubernetes (docker 17.03) resolves the issue, with "allowPrivilegeEscalation: false" left in place in the coredns deployment, and SELinux enabled in permissive mode.
So, it appears as if there is a incompatibility between old versions of docker and SELinux with the allowPrivilegeEscalation directive which has apparently been resolved in later versions of docker.
There appear to be 3 different work-arounds:
@chrisohaver I have resolved the issue by upgrading to newer version of docker 17.03. thx
thanks for the investigation @chrisohaver :100:
Thanks, @chrisohaver !
This worked:
kubectl -n kube-system get deployment coredns -o yaml | \
sed 's/allowPrivilegeEscalation: false/allowPrivilegeEscalation: true/g' | \
kubectl apply -f -
@chrisohaver
do you think we should document this step in the kubeadm troubleshooting guide for SELinux nodes in the lines of:
coredns
pods have CrashLoopBackOff
or Error
stateIf you have nodes that are running SELinux with an older version of Docker you might experience a scenario where the coredns
pods are not starting. To solve that you can try one of the following options:
coredns
deployment to set allowPrivilegeEscalation
to true
: kubectl -n kube-system get deployment coredns -o yaml | \
sed 's/allowPrivilegeEscalation: false/allowPrivilegeEscalation: true/g' | \
kubectl apply -f -
WDYT? please suggest amends to the text if you think something can be improved.
Thats fine. We should perhaps mention that there are negative security implications when disabling SELinux, or changing the allowPrivilegeEscalation setting.
The most secure solution is to upgrade Docker to the version that Kubernetes recommends (17.03)
@chrisohaver
understood, will amend the copy and submit a PR for this.
There is also answer for that in stackoverflow:
https://stackoverflow.com/questions/53075796/coredns-pods-have-crashloopbackoff-or-error-state
This error
[FATAL] plugin/loop: Seen "HINFO IN 6900627972087569316.7905576541070882081." more than twice, loop detected
is caused when CoreDNS detects a loop in the resolve configuration, and it is the intended behavior. You are hitting this issue:
https://github.com/kubernetes/kubeadm/issues/1162
https://github.com/coredns/coredns/issues/2087
Hacky solution: Disable the CoreDNS loop detection
Edit the CoreDNS configmap:
kubectl -n kube-system edit configmap coredns
Remove or comment out the line with loop
, save and exit.
Then remove the CoreDNS pods, so new ones can be created with new config:
kubectl -n kube-system delete pod -l k8s-app=kube-dns
All should be fine after that.
Preferred Solution: Remove the loop in the DNS configuration
First, check if you are using systemd-resolved
. If you are running Ubuntu 18.04, it is probably the case.
systemctl list-unit-files | grep enabled | grep systemd-resolved
If it is, check which resolv.conf
file your cluster is using as reference:
ps auxww | grep kubelet
You might see a line like:
/usr/bin/kubelet ... --resolv-conf=/run/systemd/resolve/resolv.conf
The important part is --resolv-conf
- we figure out if systemd resolv.conf is used, or not.
If it is the resolv.conf
of systemd
, do the following:
Check the content of /run/systemd/resolve/resolv.conf
to see if there is a record like:
nameserver 127.0.0.1
If there is 127.0.0.1
, it is the one causing the loop.
To get rid of it, you should not edit that file, but check other places to make it properly generated.
Check all files under /etc/systemd/network
and if you find a record like
DNS=127.0.0.1
delete that record. Also check /etc/systemd/resolved.conf
and do the same if needed. Make sure you have at least one or two DNS servers configured, such as
DNS=1.1.1.1 1.0.0.1
After doing all that, restart the systemd services to put your changes into effect:
systemctl restart systemd-networkd systemd-resolved
After that, verify that DNS=127.0.0.1
is no more in the resolv.conf
file:
cat /run/systemd/resolve/resolv.conf
Finally, trigger re-creation of the DNS pods
kubectl -n kube-system delete pod -l k8s-app=kube-dns
Summary: The solution involves getting rid of what looks like a DNS lookup loop from the host DNS configuration. Steps vary between different resolv.conf managers/implementations.
Thanks. It's also covered in the CoreDNS loop plugin readme ...
I have same problem , and another problem
1、mean i can not found dns . the error is
[ERROR] plugin/errors: 2 2115717704248378980.1120568170924441806. HINFO: unreachable backend: read udp 10.224.0.3:57088->8.8.8.8:53: i/o timeout
[ERROR] plugin/errors: 2 2115717704248378980.1120568170924441806. HINFO: unreachable backend: read udp 10.224.0.3:38819->172.16.254.1:53: i/o timeout
........
my /etc/resolv.com
noly have
nameserver 172.16.254.1 #this is my dns
nameserver 8.8.8.8 #another dns in net
i run
kubectl -n kube-system get deployment coredns -o yaml | \
sed 's/allowPrivilegeEscalation: false/allowPrivilegeEscalation: true/g' | \
kubectl apply -f -
then pod rebuild only have one error
[ERROR] plugin/errors: 2 10594135170717325.8545646296733374240. HINFO: unreachable backend: no upstream host
I don't know if that's normal . maybe
2、the coredns cannot found my api service . error is
kube-dns Failed to list *v1.Endpoints getsockopt: 10.96.0.1:6443 api connection refused
coredns restart again and again ,at last will CrashLoopBackOff
so i have to run coredns on master node i do that
kubectl edit deployment/coredns --namespace=kube-system
spec.template.spec
nodeSelector:
node-role.kubernetes.io/master: ""
I don't know if that's normal
at last give my env
Linux 4.20.10-1.el7.elrepo.x86_64 /// centos 7
docker Version: 18.09.3
[root@k8smaster00 ~]# docker image ls -a
REPOSITORY TAG IMAGE ID CREATED SIZE
k8s.gcr.io/kube-controller-manager v1.13.3 0482f6400933 6 weeks ago 146MB
k8s.gcr.io/kube-proxy v1.13.3 98db19758ad4 6 weeks ago 80.3MB
k8s.gcr.io/kube-apiserver v1.13.3 fe242e556a99 6 weeks ago 181MB
k8s.gcr.io/kube-scheduler v1.13.3 3a6f709e97a0 6 weeks ago 79.6MB
quay.io/coreos/flannel v0.11.0-amd64 ff281650a721 7 weeks ago 52.6MB
k8s.gcr.io/coredns 1.2.6 f59dcacceff4 4 months ago 40MB
k8s.gcr.io/etcd 3.2.24 3cab8e1b9802 6 months ago 220MB
k8s.gcr.io/pause 3.1 da86e6ba6ca1 15 months ago 742kB
kubenets is 1.13.3
I think this is a bug Expect an official update or a solution
I have same problem ...
@mengxifl, Those errors are significantly different than the ones reported and discussed in this issue.
[ERROR] plugin/errors: 2 2115717704248378980.1120568170924441806. HINFO: unreachable backend: read udp 10.224.0.3:57088->8.8.8.8:53: i/o timeout
[ERROR] plugin/errors: 2 2115717704248378980.1120568170924441806. HINFO: unreachable backend: read udp 10.224.0.3:38819->172.16.254.1:53: i/o timeout
Those errors mean that the CoreDNS pod (and probably all other pods) cannot reach your nameservers. This suggests a networking problem in your cluster to the outside world. Possibly flannel misconfiguration or firewalls.
the coredns cannot found my api service ...
so i have to run coredns on master node
This is also not normal. If I understand you correctly, you are saying that CoreDNS can contact the API from the master node but not other nodes. This would suggest pod to service networking problems between nodes within your cluster - perhaps an issue with flannel configuration or firewalls.
I have same problem ...
@mengxifl, Those errors are significantly different than the ones reported and discussed in this issue.
[ERROR] plugin/errors: 2 2115717704248378980.1120568170924441806. HINFO: unreachable backend: read udp 10.224.0.3:57088->8.8.8.8:53: i/o timeout
[ERROR] plugin/errors: 2 2115717704248378980.1120568170924441806. HINFO: unreachable backend: read udp 10.224.0.3:38819->172.16.254.1:53: i/o timeoutThose errors mean that the CoreDNS pod (and probably all other pods) cannot reach your nameservers. This suggests a networking problem in your cluster to the outside world. Possibly flannel misconfiguration or firewalls.
the coredns cannot found my api service ...
so i have to run coredns on master nodeThis is also not normal. If I understand you correctly, you are saying that CoreDNS can contact the API from the master node but not other nodes. This would suggest pod to service networking problems between nodes within your cluster - perhaps an issue with flannel configuration or firewalls.
Thank you for your reply
maybe i should put up my yaml file
I use
kubeadm init --config=config.yaml
my config.yaml content is
apiVersion: kubeadm.k8s.io/v1alpha3
kind: InitConfiguration
apiEndpoint:
advertiseAddress: "172.16.254.74"
bindPort: 6443
---
apiVersion: kubeadm.k8s.io/v1alpha3
kind: ClusterConfiguration
kubernetesVersion: "v1.13.3"
etcd:
external:
endpoints:
- "https://172.16.254.86:2379"
- "https://172.16.254.87:2379"
- "https://172.16.254.88:2379"
caFile: /etc/kubernetes/pki/etcd/ca.pem
certFile: /etc/kubernetes/pki/etcd/client.pem
keyFile: /etc/kubernetes/pki/etcd/client-key.pem
networking:
podSubnet: "10.224.0.0/16"
serviceSubnet: "10.96.0.0/12"
apiServerCertSANs:
- k8smaster00
- k8smaster01
- k8snode00
- k8snode01
- 172.16.254.74
- 172.16.254.79
- 172.16.254.80
- 172.16.254.81
- 172.16.254.85 #Vip
- 127.0.0.1
clusterName: "cluster"
controlPlaneEndpoint: "172.16.254.85:6443"
apiServerExtraArgs:
service-node-port-range: 20-65535
my fannel yaml is default
https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
systemctl status firewalld
ALL node say
Unit firewalld.service could not be found.
cat /etc/sysconfig/iptables
ALL node say
*filter
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
-A INPUT -p tcp -m tcp --dport 1:65535 -j ACCEPT
-A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
-A INPUT -p icmp -j ACCEPT
-A INPUT -i lo -j ACCEPT
-A INPUT -p tcp -m state --state NEW -m tcp --dport 22 -j ACCEPT
-A OUTPUT -p tcp -m tcp --sport 1:65535 -j ACCEPT
-A FORWARD -p tcp -m tcp --dport 1:65535 -j ACCEPT
-A FORWARD -p tcp -m tcp --sport 1:65535 -j ACCEPT
COMMI
cat /etc/resolv.conf & ping bing.com
ALL node say
[1] 6330
nameserver 172.16.254.1
nameserver 8.8.8.8
PING bing.com (13.107.21.200) 56(84) bytes of data.
64 bytes from 13.107.21.200 (13.107.21.200): icmp_seq=2 ttl=111 time=149 ms
uname -rs
master node say
Linux 4.20.10-1.el7.elrepo.x86_64
uname -rs
slave node say
Linux 4.4.176-1.el7.elrepo.x86_64
so i don't think firewall have issue mybe fannel ? but i use default config . And maybe linux version . i don't know .
OK I run
/sbin/iptables -t nat -I POSTROUTING -s 10.224.0.0/16 -j MASQUERADE
on all my node that work for me . thanks
Most helpful comment
Thanks, @chrisohaver !
This worked: