Kubernetes: Some nodes are not considered in scheduling when there is zone imbalance

Created on 30 May 2020 · 129Comments · Source: kubernetes/kubernetes

What happened: We upgraded 15 kubernetes clusters from 1.17.5 to 1.18.2/1.18.3 and started to see that daemonsets does not work properly anymore.

The problem is that all daemonset pods does not provision. It will return following error message to events:

Events:
  Type     Reason            Age               From               Message
  ----     ------            ----              ----               -------
  Warning  FailedScheduling  9s (x5 over 71s)  default-scheduler  0/13 nodes are available: 12 node(s) didn't match node selector.

However, all nodes are available and it does not have node selector. Nodes does not have taints either.

daemonset https://gist.github.com/zetaab/4a605cb3e15e349934cb7db29ec72bd8

% kubectl get nodes
NAME                                   STATUS   ROLES    AGE   VERSION
e2etest-1-kaasprod-k8s-local           Ready    node     46h   v1.18.3
e2etest-2-kaasprod-k8s-local           Ready    node     46h   v1.18.3
e2etest-3-kaasprod-k8s-local           Ready    node     44h   v1.18.3
e2etest-4-kaasprod-k8s-local           Ready    node     44h   v1.18.3
master-zone-1-1-1-kaasprod-k8s-local   Ready    master   47h   v1.18.3
master-zone-2-1-1-kaasprod-k8s-local   Ready    master   47h   v1.18.3
master-zone-3-1-1-kaasprod-k8s-local   Ready    master   47h   v1.18.3
nodes-z1-1-kaasprod-k8s-local          Ready    node     47h   v1.18.3
nodes-z1-2-kaasprod-k8s-local          Ready    node     47h   v1.18.3
nodes-z2-1-kaasprod-k8s-local          Ready    node     46h   v1.18.3
nodes-z2-2-kaasprod-k8s-local          Ready    node     46h   v1.18.3
nodes-z3-1-kaasprod-k8s-local          Ready    node     47h   v1.18.3
nodes-z3-2-kaasprod-k8s-local          Ready    node     46h   v1.18.3

% kubectl get pods -n weave -l weave-scope-component=agent -o wide
NAME                      READY   STATUS    RESTARTS   AGE     IP           NODE                                   NOMINATED NODE   READINESS GATES
weave-scope-agent-2drzw   1/1     Running   0          26h     10.1.32.23   e2etest-1-kaasprod-k8s-local           <none>           <none>
weave-scope-agent-4kpxc   1/1     Running   3          26h     10.1.32.12   nodes-z1-2-kaasprod-k8s-local          <none>           <none>
weave-scope-agent-78n7r   1/1     Running   0          26h     10.1.32.7    e2etest-4-kaasprod-k8s-local           <none>           <none>
weave-scope-agent-9m4n8   1/1     Running   0          26h     10.1.96.4    master-zone-1-1-1-kaasprod-k8s-local   <none>           <none>
weave-scope-agent-b2gnk   1/1     Running   1          26h     10.1.96.12   master-zone-3-1-1-kaasprod-k8s-local   <none>           <none>
weave-scope-agent-blwtx   1/1     Running   2          26h     10.1.32.20   nodes-z1-1-kaasprod-k8s-local          <none>           <none>
weave-scope-agent-cbhjg   1/1     Running   0          26h     10.1.64.15   e2etest-2-kaasprod-k8s-local           <none>           <none>
weave-scope-agent-csp49   1/1     Running   0          26h     10.1.96.14   e2etest-3-kaasprod-k8s-local           <none>           <none>
weave-scope-agent-g4k2x   1/1     Running   1          26h     10.1.64.10   nodes-z2-2-kaasprod-k8s-local          <none>           <none>
weave-scope-agent-kx85h   1/1     Running   2          26h     10.1.96.6    nodes-z3-1-kaasprod-k8s-local          <none>           <none>
weave-scope-agent-lllqc   0/1     Pending   0          5m56s   <none>       <none>                                 <none>           <none>
weave-scope-agent-nls2h   1/1     Running   0          26h     10.1.96.17   master-zone-2-1-1-kaasprod-k8s-local   <none>           <none>
weave-scope-agent-p8njs   1/1     Running   2          26h     10.1.96.19   nodes-z3-2-kaasprod-k8s-local          <none>           <none>

I have tried to restart apiserver/schedulers/controller-managers but it does not help. Also I have tried to restart that single node that is stuck (nodes-z2-1-kaasprod-k8s-local) but it does not help either. Only deleting that node and recreating it helps.

% kubectl describe node nodes-z2-1-kaasprod-k8s-local
Name:               nodes-z2-1-kaasprod-k8s-local
Roles:              node
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=59cf4871-de1b-4294-9e9f-2ea7ca4b771f
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=regionOne
                    failure-domain.beta.kubernetes.io/zone=zone-2
                    kops.k8s.io/instancegroup=nodes-z2
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=nodes-z2-1-kaasprod-k8s-local
                    kubernetes.io/os=linux
                    kubernetes.io/role=node
                    node-role.kubernetes.io/node=
                    node.kubernetes.io/instance-type=59cf4871-de1b-4294-9e9f-2ea7ca4b771f
                    topology.cinder.csi.openstack.org/zone=zone-2
                    topology.kubernetes.io/region=regionOne
                    topology.kubernetes.io/zone=zone-2
Annotations:        csi.volume.kubernetes.io/nodeid: {"cinder.csi.openstack.org":"faf14d22-010f-494a-9b34-888bdad1d2df"}
                    node.alpha.kubernetes.io/ttl: 0
                    projectcalico.org/IPv4Address: 10.1.64.32/19
                    projectcalico.org/IPv4IPIPTunnelAddr: 100.98.136.0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Thu, 28 May 2020 13:28:24 +0300
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  nodes-z2-1-kaasprod-k8s-local
  AcquireTime:     <unset>
  RenewTime:       Sat, 30 May 2020 12:02:13 +0300
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Fri, 29 May 2020 09:40:51 +0300   Fri, 29 May 2020 09:40:51 +0300   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Sat, 30 May 2020 11:59:53 +0300   Fri, 29 May 2020 09:40:45 +0300   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Sat, 30 May 2020 11:59:53 +0300   Fri, 29 May 2020 09:40:45 +0300   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Sat, 30 May 2020 11:59:53 +0300   Fri, 29 May 2020 09:40:45 +0300   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Sat, 30 May 2020 11:59:53 +0300   Fri, 29 May 2020 09:40:45 +0300   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  10.1.64.32
  Hostname:    nodes-z2-1-kaasprod-k8s-local
Capacity:
  cpu:                4
  ephemeral-storage:  10287360Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             8172420Ki
  pods:               110
Allocatable:
  cpu:                4
  ephemeral-storage:  9480830961
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             8070020Ki
  pods:               110
System Info:
  Machine ID:                 c94284656ff04cf090852c1ddee7bcc2
  System UUID:                faf14d22-010f-494a-9b34-888bdad1d2df
  Boot ID:                    295dc3d9-0a90-49ee-92f3-9be45f2f8e3d
  Kernel Version:             4.19.0-8-cloud-amd64
  OS Image:                   Debian GNU/Linux 10 (buster)
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://19.3.8
  Kubelet Version:            v1.18.3
  Kube-Proxy Version:         v1.18.3
PodCIDR:                      100.96.12.0/24
PodCIDRs:                     100.96.12.0/24
ProviderID:                   openstack:///faf14d22-010f-494a-9b34-888bdad1d2df
Non-terminated Pods:          (3 in total)
  Namespace                   Name                                        CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                   ----                                        ------------  ----------  ---------------  -------------  ---
  kube-system                 calico-node-77pqs                           100m (2%)     200m (5%)   100Mi (1%)       100Mi (1%)     46h
  kube-system                 kube-proxy-nodes-z2-1-kaasprod-k8s-local    100m (2%)     200m (5%)   100Mi (1%)       100Mi (1%)     46h
  volume                      csi-cinder-nodeplugin-5jbvl                 100m (2%)     400m (10%)  200Mi (2%)       200Mi (2%)     46h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                300m (7%)   800m (20%)
  memory             400Mi (5%)  400Mi (5%)
  ephemeral-storage  0 (0%)      0 (0%)
Events:
  Type    Reason                   Age    From                                    Message
  ----    ------                   ----   ----                                    -------
  Normal  Starting                 7m27s  kubelet, nodes-z2-1-kaasprod-k8s-local  Starting kubelet.
  Normal  NodeHasSufficientMemory  7m26s  kubelet, nodes-z2-1-kaasprod-k8s-local  Node nodes-z2-1-kaasprod-k8s-local status is now: NodeHasSufficientMemory
  Normal  NodeHasNoDiskPressure    7m26s  kubelet, nodes-z2-1-kaasprod-k8s-local  Node nodes-z2-1-kaasprod-k8s-local status is now: NodeHasNoDiskPressure
  Normal  NodeHasSufficientPID     7m26s  kubelet, nodes-z2-1-kaasprod-k8s-local  Node nodes-z2-1-kaasprod-k8s-local status is now: NodeHasSufficientPID
  Normal  NodeAllocatableEnforced  7m26s  kubelet, nodes-z2-1-kaasprod-k8s-local  Updated Node Allocatable limit across pods

We are seeing this randomly in all of our clusters.

What you expected to happen: I expect that daemonset will provision to all nodes.

How to reproduce it (as minimally and precisely as possible): No idea really, install 1.18.x kubernetes and deploy daemonset and after that wait days(?)

Anything else we need to know?: When this happens we cannot provision any other daemonsets to that node either. Like you can see logging fluent-bit is also missing. I cannot see any errors in that node kubelet logs and like said, restarting does not help.

% kubectl get ds --all-namespaces
NAMESPACE     NAME                       DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                     AGE
falco         falco-daemonset            13        13        12      13           12          <none>                            337d
kube-system   audit-webhook-deployment   3         3         3       3            3           node-role.kubernetes.io/master=   174d
kube-system   calico-node                13        13        13      13           13          kubernetes.io/os=linux            36d
kube-system   kops-controller            3         3         3       3            3           node-role.kubernetes.io/master=   193d
kube-system   metricbeat                 6         6         5       6            5           <none>                            35d
kube-system   openstack-cloud-provider   3         3         3       3            3           node-role.kubernetes.io/master=   337d
logging       fluent-bit                 13        13        12      13           12          <none>                            337d
monitoring    node-exporter              13        13        12      13           12          kubernetes.io/os=linux            58d
volume        csi-cinder-nodeplugin      6         6         6       6            6           <none>                            239d
weave         weave-scope-agent          13        13        12      13           12          <none>                            193d
weave         weavescope-iowait-plugin   6         6         5       6            5           <none>                            193d

Like you can see, most of the daemonsets are missing one pod

Environment:

Kubernetes version (use kubectl version): 1.18.3
Cloud provider or hardware configuration: openstack
OS (e.g: cat /etc/os-release): debian buster
Kernel (e.g. uname -a): Linux nodes-z2-1-kaasprod-k8s-local 4.19.0-8-cloud-amd64 #1 SMP Debian 4.19.98-1+deb10u1 (2020-04-27) x86_64 GNU/Linux
Install tools: kops
Network plugin and version (if this is a network-related bug): calico
Others:

help wanted kinbug prioritimportant-soon sischeduling

Source

zetaab

Most helpful comment

I am now working on adding a test case for the snapshot, to make sure this is properly tested.

maelk on 22 Jul 2020

🎉1 👍1

All 129 comments

/sig scheduling

zetaab on 30 May 2020

Can you provide the full yaml of the node, daemonset, an example pod, and the containing namespace as retrieved from the server?

liggitt on 30 May 2020

node:
https://gist.github.com/zetaab/2a7e8d3fe6cb42a617e17abc0fa375f7

daemonset:
https://gist.github.com/zetaab/31bb406c8bd622b3017bf4f468d0154f

example pod (working):
https://gist.github.com/zetaab/814871bec6f2879e371f5bbdc6f2e978

example pod (not scheduling):
https://gist.github.com/zetaab/f3488d65486c745af78dbe2e6173fd42

namespace:
https://gist.github.com/zetaab/4625b759f4e21b50757c79e5072cd7d9

zetaab on 30 May 2020

DaemonSet pods schedule with a nodeAffinity selector that only matches a single node, so the "12 out of 13 didn't match" message is expected.

liggitt on 30 May 2020

I don't see a reason why the scheduler would be unhappy with the pod/node combo… there's no ports that could conflict in the podspec, the node is not unschedulable or tainted, and has sufficient resources

liggitt on 30 May 2020

Okay I restarted all 3 schedulers (changed loglevel to 4 if we can see something interesting there). However, it fixed the issue

% kubectl get ds --all-namespaces
NAMESPACE     NAME                       DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                     AGE
falco         falco-daemonset            13        13        13      13           13          <none>                            338d
kube-system   audit-webhook-deployment   3         3         3       3            3           node-role.kubernetes.io/master=   175d
kube-system   calico-node                13        13        13      13           13          kubernetes.io/os=linux            36d
kube-system   kops-controller            3         3         3       3            3           node-role.kubernetes.io/master=   194d
kube-system   metricbeat                 6         6         6       6            6           <none>                            36d
kube-system   openstack-cloud-provider   3         3         3       3            3           node-role.kubernetes.io/master=   338d
logging       fluent-bit                 13        13        13      13           13          <none>                            338d
monitoring    node-exporter              13        13        13      13           13          kubernetes.io/os=linux            59d
volume        csi-cinder-nodeplugin      6         6         6       6            6           <none>                            239d
weave         weave-scope-agent          13        13        13      13           13          <none>                            194d
weave         weavescope-iowait-plugin   6         6         6       6            6           <none>                            194d

now all daemonsets are provisioned correctly. Weird, anyways something wrong with the scheduler it seems

zetaab on 30 May 2020

cc @kubernetes/sig-scheduling-bugs @ahg-g

liggitt on 30 May 2020

We see same similar issue on v1.18.3, one node cannot be scheduled for daemonset pods.
restart scheduler helps.

[root@tesla-cb0434-csfp1-csfp1-control-03 ~]# kubectl get pod -A|grep Pending
kube-system   coredns-vc5ws                                                 0/1     Pending   0          2d16h
kube-system   local-volume-provisioner-mwk88                                0/1     Pending   0          2d16h
kube-system   svcwatcher-ltqb6                                              0/1     Pending   0          2d16h
ncms          bcmt-api-hfzl6                                                0/1     Pending   0          2d16h
ncms          bcmt-yum-repo-589d8bb756-5zbvh                                0/1     Pending   0          2d16h
[root@tesla-cb0434-csfp1-csfp1-control-03 ~]# kubectl get ds -A
NAMESPACE     NAME                       DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                   AGE
kube-system   coredns                    3         3         2       3            2           is_control=true                 2d16h
kube-system   danmep-cleaner             0         0         0       0            0           cbcs.nokia.com/danm_node=true   2d16h
kube-system   kube-proxy                 8         8         8       8            8           <none>                          2d16h
kube-system   local-volume-provisioner   8         8         7       8            7           <none>                          2d16h
kube-system   netwatcher                 0         0         0       0            0           cbcs.nokia.com/danm_node=true   2d16h
kube-system   sriov-device-plugin        0         0         0       0            0           sriov=enabled                   2d16h
kube-system   svcwatcher                 3         3         2       3            2           is_control=true                 2d16h
ncms          bcmt-api                   3         3         0       3            0           is_control=true                 2d16h
[root@tesla-cb0434-csfp1-csfp1-control-03 ~]# kubectl get node
NAME                                  STATUS   ROLES    AGE     VERSION
tesla-cb0434-csfp1-csfp1-control-01   Ready    <none>   2d16h   v1.18.3
tesla-cb0434-csfp1-csfp1-control-02   Ready    <none>   2d16h   v1.18.3
tesla-cb0434-csfp1-csfp1-control-03   Ready    <none>   2d16h   v1.18.3
tesla-cb0434-csfp1-csfp1-edge-01      Ready    <none>   2d16h   v1.18.3
tesla-cb0434-csfp1-csfp1-edge-02      Ready    <none>   2d16h   v1.18.3
tesla-cb0434-csfp1-csfp1-worker-01    Ready    <none>   2d16h   v1.18.3
tesla-cb0434-csfp1-csfp1-worker-02    Ready    <none>   2d16h   v1.18.3
tesla-cb0434-csfp1-csfp1-worker-03    Ready    <none>   2d16h   v1.18.3

jejer on 1 Jun 2020

Hard to debug without knowing how to repreduce. Do you have the scheduler logs by any chance for the failed to schedule pod?

ahg-g on 1 Jun 2020

Okay I restarted all 3 schedulers

I assume only one of them is named default-scheduler, correct?

changed loglevel to 4 if we can see something interesting there

Can you share what you noticed?

ahg-g on 1 Jun 2020

set loglevel to 9, but it seems there is nothing more interesting, below logs are looping.

I0601 01:45:05.039373       1 generic_scheduler.go:290] Preemption will not help schedule pod kube-system/coredns-vc5ws on any node.
I0601 01:45:05.039437       1 factory.go:462] Unable to schedule kube-system/coredns-vc5ws: no fit: 0/8 nodes are available: 7 node(s) didn't match node selector.; waiting
I0601 01:45:05.039494       1 scheduler.go:776] Updating pod condition for kube-system/coredns-vc5ws to (PodScheduled==False, Reason=Unschedulable)

jejer on 1 Jun 2020

yeah I could not see anything more than same line

no fit: 0/8 nodes are available: 7 node(s) didn't match node selector.; waiting

zetaab on 1 Jun 2020

the strange thing is that the log message is showing the result for 7 nodes only, like the issue reported in https://github.com/kubernetes/kubernetes/issues/91340

ahg-g on 1 Jun 2020

/cc @damemi

ahg-g on 1 Jun 2020

@ahg-g this does look like the same issue I reported there, it seems like we either have a filter plugin that doesn't always report its error or some other condition that's failing silently if I had to guess

damemi on 1 Jun 2020

Note that in my issue, restarting the scheduler also fixed it (as mentioned in this thread too https://github.com/kubernetes/kubernetes/issues/91601#issuecomment-636360092)

Mine was also about a daemonset, so I think this is a duplicate. If that's the case we can close this and continue discussion in https://github.com/kubernetes/kubernetes/issues/91340

damemi on 1 Jun 2020

Anyways scheduler needs more verbose logging option, its impossible to debug these issues if there are not logs about what it does

zetaab on 1 Jun 2020

👍2

@zetaab +1, the scheduler could use significant improvements to its current logging abilities. That's an upgrade I've been meaning to tackle for a while and I've finally opened an issue for it here: https://github.com/kubernetes/kubernetes/issues/91633

damemi on 1 Jun 2020

/assign

I'm looking into this. A few questions to help me narrow the case. I haven't been able to reproduce yet.

What was created first: the daemonset or the node?
Are you using the default profile?

alculquicondor on 5 Jun 2020

Do you have extenders?

alculquicondor on 5 Jun 2020

nodes were created before the daemonset.
suppose we used the default profile, which profile do you mean and how to check?
no extenders.

    command:
    - /usr/local/bin/kube-scheduler
    - --address=127.0.0.1
    - --kubeconfig=/etc/kubernetes/kube-scheduler.kubeconfig
    - --profiling=false
    - --v=1

Another thing that may impact is the disk performance is not very good for etcd, etcd complains about slow operations.

jejer on 9 Jun 2020

Yes, those flags would make scheduler run with the default profile. I'll continue looking. I still couldn't reproduce.

alculquicondor on 9 Jun 2020

Still nothing... Anything else you are using that you think could be impacting? taints, ports, other resources?

alculquicondor on 11 Jun 2020

Made some tries related to this. When the issue is on, pods can still be scheduled to the node (without definition or with "nodeName" selector).

If trying to use Affinity/Antiaffinity, pods doesn't get scheduled to node.

Working when issue is on:

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: nginx
  name: nginx
spec:
  nodeName: master-zone-3-1-1-test-cluster-k8s-local
  containers:
    - image: nginx
      name: nginx
      resources: {}
  dnsPolicy: ClusterFirst
  restartPolicy: Always

Not working at the same time:

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: nginx
  name: nginx
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: kubernetes.io/hostname
                operator: In
                values:
                  - master-zone-3-1-1-test-cluster-k8s-local
  containers:
    - image: nginx
      name: nginx
      resources: {}
  dnsPolicy: ClusterFirst
  restartPolicy: Always

Also when checked the latter's even those were quite interesting:

Warning  FailedScheduling  4m37s (x17 over 26m)  default-scheduler  0/9 nodes are available: 8 node(s) didn't match node selector.
Warning  FailedScheduling  97s (x6 over 3m39s)   default-scheduler  0/8 nodes are available: 8 node(s) didn't match node selector.
Warning  FailedScheduling  53s                   default-scheduler  0/8 nodes are available: 8 node(s) didn't match node selector.
Warning  FailedScheduling  7s (x5 over 32s)      default-scheduler  0/9 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 7 node(s) didn't match node selector.

First event is when manifest was just applied (nothing done to the non-schedulable node).
Second and third were when node was removed with kubectl and then restarted.
Fourth came when the node came back up. The node that had an issue was master, so node was not going there (but it shows that node was not found at 3 earlier events). Interesting thing with fourth event is that there's still information from one node missing. Event says there's 0/9 nodes available, but description is given only from 8.

Hi-Fi on 29 Jun 2020

"nodeName" is not a selector. Using nodeName would bypass scheduling.

Fourth came when the node came back up. The node that had an issue was master, so node was not going there (but it shows that node was not found at 3 earlier events). Interesting thing with fourth event is that there's still information from one node missing. Event says there's 0/9 nodes available, but description is given only from 8.

You are saying that the reason why the pod shouldn't have been scheduled in the missing node is because it was a master?

We are seeing 8 node(s) didn't match node selector going to 7. I assume no nodes were removed at this point, correct?

alculquicondor on 29 Jun 2020

"nodeName" is not a selector. Using nodeName would bypass scheduling.

"NodeName" try was to highligh, that node is usable and pod gets there if wanted. So thing is not node's unability to start pods.

Fourth came when the node came back up. The node that had an issue was master, so node was not going there (but it shows that node was not found at 3 earlier events). Interesting thing with fourth event is that there's still information from one node missing. Event says there's 0/9 nodes available, but description is given only from 8.

You are saying that the reason why the pod shouldn't have been scheduled in the missing node is because it was a master?

We are seeing 8 node(s) didn't match node selector going to 7. I assume no nodes were removed at this point, correct?

Test cluster has 9 nodes; 3 masters and 6 workers. Before the non-working node was successfully started, events told information about all available nodes: 0/8 nodes are available: 8 node(s) didn't match node selector.. But when that node that would match node selector came up, the event told 0/9 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 7 node(s) didn't match node selector. Explanation tells that there's 8 that are not matching, but doesn't tell anything about the ninth (that was acknowledged on the previous event).

So the event state:

1st event: 9 nodes available, the error noticed with daemonset
2nd and 3rd event: 8 nodes available. The one that was not receiving pod was restarting
4th event: 9 nodes available (so the one started that was restarted).

At the end test pod wasn't started at the matching node because of taints, but that's other story (and should have been the case already at the 1st event).

Hi-Fi on 29 Jun 2020

"NodeName" try was to highligh, that node is usable and pod gets there if wanted. So thing is not node's unability to start pods.

Note that nothing guards against over-committing a node, but the scheduler. So this doesn't really show much.

At the end test pod wasn't started at the matching node because of taints, but that's other story (and should have been the case already at the 1st event).

My question is: was the 9th node tainted from the beginning? I'm trying to look for (1) reproducible steps to reach the state or (2) where the bug could be.

alculquicondor on 29 Jun 2020

My question is: was the 9th node tainted from the beginning? I'm trying to look for (1) reproducible steps to reach the state or (2) where the bug could be.

Yes, taint was there all the time at this case, as the non-receiving node was master. But we have seen the same issue on both masters and workers.

Still no idea where the issue comes from, just that at least recreation of node and restart of the node seem to be fixing the issue. But those are a bit "hard" ways to fix the things.

Hi-Fi on 29 Jun 2020

Long shot, but if you run into it again... could you check if there are any nominated pods to the node that doesn't show up?

alculquicondor on 29 Jun 2020

I'm posting questions as I think of possible scenarios:

Do you have other master nodes in your cluster?
Do you have extenders?

alculquicondor on 29 Jun 2020

* Do you have other master nodes in your cluster?

All clusers have 3 masters (so restarting of those is easy)

* Do you have extenders?

No.

One interesting thing noticed today: I had cluster where one master was not receiving pod from DaemonSet. We have ChaosMonkey in use, which terminated one of the worker nodes. That's interesting, this made the pod to go to the master that was not receiving it earlier. So somehow removal of other node than the problematic one seemed to be fixing the issue at that point.

Because of that "fix" I have to wait problem to reoccur to be able to answer about the nominated pods.

Hi-Fi on 30 Jun 2020

I'm confused now... Does your daemonset tolerate the taint for master nodes? In other words... is the bug for you just the scheduling event or also the fact that the pods should have been scheduled?

alculquicondor on 30 Jun 2020

Issue is, that node is not found by scheduler even there's at least one matching affinity (or antiaffinity) settings.

That's why I said that the taint error is expected, and should have been there already at the first event (as taint is not part of the affinity criteria)

Hi-Fi on 30 Jun 2020

Understood. I was trying to confirm your setup to make sure I'm not missing something.

I don't think the node is "unseen" by the scheduler. Given that we see 0/9 nodes are available, we can conclude that the node is indeed in the cache. It's more like the unschedulable reason is lost somewhere, so we don't include it in the event.

alculquicondor on 30 Jun 2020

👍1

True, total count matches always with actual node count. Just more descriptive event text is not given on all nodes, but that can be separate issue as you mentioned.

Hi-Fi on 30 Jun 2020

Are you able to look at your kube-scheduler logs? Anything that seems relevant?

alculquicondor on 30 Jun 2020

I think @zetaab tried to look for that without success. I can try when the issue occurs again (as well as that nominated pod thing asked earlier)

Hi-Fi on 30 Jun 2020

If possible, also run 1.18.5, in case we inadvertently fixed the issue.

alculquicondor on 30 Jun 2020

I am able to reproduce this reliably on my test cluster if you need any more logs

dilyevsky on 10 Jul 2020

@dilyevsky Please share repro steps. Can you somehow identify what is the filter that is failing?

alculquicondor on 10 Jul 2020

It appears to be just the metadata.name of the node for the ds pod... weird. Here's the pod yaml:

Pod yaml:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    scheduler.alpha.kubernetes.io/critical-pod: ""
  creationTimestamp: "2020-07-09T23:17:53Z"
  generateName: cilium-
  labels:
    controller-revision-hash: 6c94db8bb8
    k8s-app: cilium
    pod-template-generation: "1"
  managedFields:
    # managed fields crap
  name: cilium-d5n4f
  namespace: kube-system
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: DaemonSet
    name: cilium
    uid: 0f00e8af-eb19-4985-a940-a02fa84fcbc5
  resourceVersion: "2840"
  selfLink: /api/v1/namespaces/kube-system/pods/cilium-d5n4f
  uid: e3f7d566-ee5b-4557-8d1b-f0964cde2f22
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchFields:
          - key: metadata.name
            operator: In
            values:
            - us-central1-dilyevsky-master-qmwnl
  containers:
  - args:
    - --config-dir=/tmp/cilium/config-map
    command:
    - cilium-agent
    env:
    - name: K8S_NODE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName
    - name: CILIUM_K8S_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    - name: CILIUM_FLANNEL_MASTER_DEVICE
      valueFrom:
        configMapKeyRef:
          key: flannel-master-device
          name: cilium-config
          optional: true
    - name: CILIUM_FLANNEL_UNINSTALL_ON_EXIT
      valueFrom:
        configMapKeyRef:
          key: flannel-uninstall-on-exit
          name: cilium-config
          optional: true
    - name: CILIUM_CLUSTERMESH_CONFIG
      value: /var/lib/cilium/clustermesh/
    - name: CILIUM_CNI_CHAINING_MODE
      valueFrom:
        configMapKeyRef:
          key: cni-chaining-mode
          name: cilium-config
          optional: true
    - name: CILIUM_CUSTOM_CNI_CONF
      valueFrom:
        configMapKeyRef:
          key: custom-cni-conf
          name: cilium-config
          optional: true
    image: docker.io/cilium/cilium:v1.7.6
    imagePullPolicy: IfNotPresent
    lifecycle:
      postStart:
        exec:
          command:
          - /cni-install.sh
          - --enable-debug=false
      preStop:
        exec:
          command:
          - /cni-uninstall.sh
    livenessProbe:
      exec:
        command:
        - cilium
        - status
        - --brief
      failureThreshold: 10
      initialDelaySeconds: 120
      periodSeconds: 30
      successThreshold: 1
      timeoutSeconds: 5
    name: cilium-agent
    readinessProbe:
      exec:
        command:
        - cilium
        - status
        - --brief
      failureThreshold: 3
      initialDelaySeconds: 5
      periodSeconds: 30
      successThreshold: 1
      timeoutSeconds: 5
    resources: {}
    securityContext:
      capabilities:
        add:
        - NET_ADMIN
        - SYS_MODULE
      privileged: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/cilium
      name: cilium-run
    - mountPath: /host/opt/cni/bin
      name: cni-path
    - mountPath: /host/etc/cni/net.d
      name: etc-cni-netd
    - mountPath: /var/lib/cilium/clustermesh
      name: clustermesh-secrets
      readOnly: true
    - mountPath: /tmp/cilium/config-map
      name: cilium-config-path
      readOnly: true
    - mountPath: /lib/modules
      name: lib-modules
      readOnly: true
    - mountPath: /run/xtables.lock
      name: xtables-lock
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: cilium-token-j74lr
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostNetwork: true
  initContainers:
  - command:
    - /init-container.sh
    env:
    - name: CILIUM_ALL_STATE
      valueFrom:
        configMapKeyRef:
          key: clean-cilium-state
          name: cilium-config
          optional: true
    - name: CILIUM_BPF_STATE
      valueFrom:
        configMapKeyRef:
          key: clean-cilium-bpf-state
          name: cilium-config
          optional: true
    - name: CILIUM_WAIT_BPF_MOUNT
      valueFrom:
        configMapKeyRef:
          key: wait-bpf-mount
          name: cilium-config
          optional: true
    image: docker.io/cilium/cilium:v1.7.6
    imagePullPolicy: IfNotPresent
    name: clean-cilium-state
    resources: {}
    securityContext:
      capabilities:
        add:
        - NET_ADMIN
      privileged: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/cilium
      name: cilium-run
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: cilium-token-j74lr
      readOnly: true
  priority: 2000001000
  priorityClassName: system-node-critical
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: cilium
  serviceAccountName: cilium
  terminationGracePeriodSeconds: 1
  tolerations:
  - operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/disk-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/pid-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/network-unavailable
    operator: Exists
  volumes:
  - hostPath:
      path: /var/run/cilium
      type: DirectoryOrCreate
    name: cilium-run
  - hostPath:
      path: /opt/cni/bin
      type: DirectoryOrCreate
    name: cni-path
  - hostPath:
      path: /etc/cni/net.d
      type: DirectoryOrCreate
    name: etc-cni-netd
  - hostPath:
      path: /lib/modules
      type: ""
    name: lib-modules
  - hostPath:
      path: /run/xtables.lock
      type: FileOrCreate
    name: xtables-lock
  - name: clustermesh-secrets
    secret:
      defaultMode: 420
      optional: true
      secretName: cilium-clustermesh
  - configMap:
      defaultMode: 420
      name: cilium-config
    name: cilium-config-path
  - name: cilium-token-j74lr
    secret:
      defaultMode: 420
      secretName: cilium-token-j74lr
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2020-07-09T23:17:53Z"
    message: '0/6 nodes are available: 5 node(s) didn''t match node selector.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending
  qosClass: BestEffort

The way I reproduce this is by spinning up new cluster with 3 masters and 3 worker nodes (using Cluster API) and applying Cilium 1.7.6:

Cilium yaml:

---
# Source: cilium/charts/agent/templates/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: cilium
  namespace: kube-system
---
# Source: cilium/charts/operator/templates/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: cilium-operator
  namespace: kube-system
---
# Source: cilium/charts/config/templates/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: cilium-config
  namespace: kube-system
data:

  # Identity allocation mode selects how identities are shared between cilium
  # nodes by setting how they are stored. The options are "crd" or "kvstore".
  # - "crd" stores identities in kubernetes as CRDs (custom resource definition).
  #   These can be queried with:
  #     kubectl get ciliumid
  # - "kvstore" stores identities in a kvstore, etcd or consul, that is
  #   configured below. Cilium versions before 1.6 supported only the kvstore
  #   backend. Upgrades from these older cilium versions should continue using
  #   the kvstore by commenting out the identity-allocation-mode below, or
  #   setting it to "kvstore".
  identity-allocation-mode: crd

  # If you want to run cilium in debug mode change this value to true
  debug: "false"

  # Enable IPv4 addressing. If enabled, all endpoints are allocated an IPv4
  # address.
  enable-ipv4: "true"

  # Enable IPv6 addressing. If enabled, all endpoints are allocated an IPv6
  # address.
  enable-ipv6: "false"

  # If you want cilium monitor to aggregate tracing for packets, set this level
  # to "low", "medium", or "maximum". The higher the level, the less packets
  # that will be seen in monitor output.
  monitor-aggregation: medium

  # The monitor aggregation interval governs the typical time between monitor
  # notification events for each allowed connection.
  #
  # Only effective when monitor aggregation is set to "medium" or higher.
  monitor-aggregation-interval: 5s

  # The monitor aggregation flags determine which TCP flags which, upon the
  # first observation, cause monitor notifications to be generated.
  #
  # Only effective when monitor aggregation is set to "medium" or higher.
  monitor-aggregation-flags: all

  # ct-global-max-entries-* specifies the maximum number of connections
  # supported across all endpoints, split by protocol: tcp or other. One pair
  # of maps uses these values for IPv4 connections, and another pair of maps
  # use these values for IPv6 connections.
  #
  # If these values are modified, then during the next Cilium startup the
  # tracking of ongoing connections may be disrupted. This may lead to brief
  # policy drops or a change in loadbalancing decisions for a connection.
  #
  # For users upgrading from Cilium 1.2 or earlier, to minimize disruption
  # during the upgrade process, comment out these options.
  bpf-ct-global-tcp-max: "524288"
  bpf-ct-global-any-max: "262144"

  # bpf-policy-map-max specified the maximum number of entries in endpoint
  # policy map (per endpoint)
  bpf-policy-map-max: "16384"

  # Pre-allocation of map entries allows per-packet latency to be reduced, at
  # the expense of up-front memory allocation for the entries in the maps. The
  # default value below will minimize memory usage in the default installation;
  # users who are sensitive to latency may consider setting this to "true".
  #
  # This option was introduced in Cilium 1.4. Cilium 1.3 and earlier ignore
  # this option and behave as though it is set to "true".
  #
  # If this value is modified, then during the next Cilium startup the restore
  # of existing endpoints and tracking of ongoing connections may be disrupted.
  # This may lead to policy drops or a change in loadbalancing decisions for a
  # connection for some time. Endpoints may need to be recreated to restore
  # connectivity.
  #
  # If this option is set to "false" during an upgrade from 1.3 or earlier to
  # 1.4 or later, then it may cause one-time disruptions during the upgrade.
  preallocate-bpf-maps: "false"

  # Regular expression matching compatible Istio sidecar istio-proxy
  # container image names
  sidecar-istio-proxy-image: "cilium/istio_proxy"

  # Encapsulation mode for communication between nodes
  # Possible values:
  #   - disabled
  #   - vxlan (default)
  #   - geneve
  tunnel: vxlan

  # Name of the cluster. Only relevant when building a mesh of clusters.
  cluster-name: default

  # DNS Polling periodically issues a DNS lookup for each `matchName` from
  # cilium-agent. The result is used to regenerate endpoint policy.
  # DNS lookups are repeated with an interval of 5 seconds, and are made for
  # A(IPv4) and AAAA(IPv6) addresses. Should a lookup fail, the most recent IP
  # data is used instead. An IP change will trigger a regeneration of the Cilium
  # policy for each endpoint and increment the per cilium-agent policy
  # repository revision.
  #
  # This option is disabled by default starting from version 1.4.x in favor
  # of a more powerful DNS proxy-based implementation, see [0] for details.
  # Enable this option if you want to use FQDN policies but do not want to use
  # the DNS proxy.
  #
  # To ease upgrade, users may opt to set this option to "true".
  # Otherwise please refer to the Upgrade Guide [1] which explains how to
  # prepare policy rules for upgrade.
  #
  # [0] http://docs.cilium.io/en/stable/policy/language/#dns-based
  # [1] http://docs.cilium.io/en/stable/install/upgrade/#changes-that-may-require-action
  tofqdns-enable-poller: "false"

  # wait-bpf-mount makes init container wait until bpf filesystem is mounted
  wait-bpf-mount: "false"

  masquerade: "true"
  enable-xt-socket-fallback: "true"
  install-iptables-rules: "true"
  auto-direct-node-routes: "false"
  kube-proxy-replacement:  "probe"
  enable-host-reachable-services: "false"
  enable-external-ips: "false"
  enable-node-port: "false"
  node-port-bind-protection: "true"
  enable-auto-protect-node-port-range: "true"
  enable-endpoint-health-checking: "true"
  enable-well-known-identities: "false"
  enable-remote-node-identity: "true"
---
# Source: cilium/charts/agent/templates/clusterrole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: cilium
rules:
- apiGroups:
  - networking.k8s.io
  resources:
  - networkpolicies
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - discovery.k8s.io
  resources:
  - endpointslices
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - namespaces
  - services
  - nodes
  - endpoints
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - pods
  - nodes
  verbs:
  - get
  - list
  - watch
  - update
- apiGroups:
  - ""
  resources:
  - nodes
  - nodes/status
  verbs:
  - patch
- apiGroups:
  - apiextensions.k8s.io
  resources:
  - customresourcedefinitions
  verbs:
  - create
  - get
  - list
  - watch
  - update
- apiGroups:
  - cilium.io
  resources:
  - ciliumnetworkpolicies
  - ciliumnetworkpolicies/status
  - ciliumclusterwidenetworkpolicies
  - ciliumclusterwidenetworkpolicies/status
  - ciliumendpoints
  - ciliumendpoints/status
  - ciliumnodes
  - ciliumnodes/status
  - ciliumidentities
  - ciliumidentities/status
  verbs:
  - '*'
---
# Source: cilium/charts/operator/templates/clusterrole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: cilium-operator
rules:
- apiGroups:
  - ""
  resources:
  # to automatically delete [core|kube]dns pods so that are starting to being
  # managed by Cilium
  - pods
  verbs:
  - get
  - list
  - watch
  - delete
- apiGroups:
  - discovery.k8s.io
  resources:
  - endpointslices
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  # to automatically read from k8s and import the node's pod CIDR to cilium's
  # etcd so all nodes know how to reach another pod running in in a different
  # node.
  - nodes
  # to perform the translation of a CNP that contains `ToGroup` to its endpoints
  - services
  - endpoints
  # to check apiserver connectivity
  - namespaces
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - cilium.io
  resources:
  - ciliumnetworkpolicies
  - ciliumnetworkpolicies/status
  - ciliumclusterwidenetworkpolicies
  - ciliumclusterwidenetworkpolicies/status
  - ciliumendpoints
  - ciliumendpoints/status
  - ciliumnodes
  - ciliumnodes/status
  - ciliumidentities
  - ciliumidentities/status
  verbs:
  - '*'
---
# Source: cilium/charts/agent/templates/clusterrolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: cilium
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cilium
subjects:
- kind: ServiceAccount
  name: cilium
  namespace: kube-system
---
# Source: cilium/charts/operator/templates/clusterrolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: cilium-operator
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cilium-operator
subjects:
- kind: ServiceAccount
  name: cilium-operator
  namespace: kube-system
---
# Source: cilium/charts/agent/templates/daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    k8s-app: cilium
  name: cilium
  namespace: kube-system
spec:
  selector:
    matchLabels:
      k8s-app: cilium
  template:
    metadata:
      annotations:
        # This annotation plus the CriticalAddonsOnly toleration makes
        # cilium to be a critical pod in the cluster, which ensures cilium
        # gets priority scheduling.
        # https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        k8s-app: cilium
    spec:
      containers:
      - args:
        - --config-dir=/tmp/cilium/config-map
        command:
        - cilium-agent
        livenessProbe:
          exec:
            command:
            - cilium
            - status
            - --brief
          failureThreshold: 10
          # The initial delay for the liveness probe is intentionally large to
          # avoid an endless kill & restart cycle if in the event that the initial
          # bootstrapping takes longer than expected.
          initialDelaySeconds: 120
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 5
        readinessProbe:
          exec:
            command:
            - cilium
            - status
            - --brief
          failureThreshold: 3
          initialDelaySeconds: 5
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 5
        env:
        - name: K8S_NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: CILIUM_K8S_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: CILIUM_FLANNEL_MASTER_DEVICE
          valueFrom:
            configMapKeyRef:
              key: flannel-master-device
              name: cilium-config
              optional: true
        - name: CILIUM_FLANNEL_UNINSTALL_ON_EXIT
          valueFrom:
            configMapKeyRef:
              key: flannel-uninstall-on-exit
              name: cilium-config
              optional: true
        - name: CILIUM_CLUSTERMESH_CONFIG
          value: /var/lib/cilium/clustermesh/
        - name: CILIUM_CNI_CHAINING_MODE
          valueFrom:
            configMapKeyRef:
              key: cni-chaining-mode
              name: cilium-config
              optional: true
        - name: CILIUM_CUSTOM_CNI_CONF
          valueFrom:
            configMapKeyRef:
              key: custom-cni-conf
              name: cilium-config
              optional: true
        image: "docker.io/cilium/cilium:v1.7.6"
        imagePullPolicy: IfNotPresent
        lifecycle:
          postStart:
            exec:
              command:
              - "/cni-install.sh"
              - "--enable-debug=false"
          preStop:
            exec:
              command:
              - /cni-uninstall.sh
        name: cilium-agent
        securityContext:
          capabilities:
            add:
            - NET_ADMIN
            - SYS_MODULE
          privileged: true
        volumeMounts:
        - mountPath: /var/run/cilium
          name: cilium-run
        - mountPath: /host/opt/cni/bin
          name: cni-path
        - mountPath: /host/etc/cni/net.d
          name: etc-cni-netd
        - mountPath: /var/lib/cilium/clustermesh
          name: clustermesh-secrets
          readOnly: true
        - mountPath: /tmp/cilium/config-map
          name: cilium-config-path
          readOnly: true
          # Needed to be able to load kernel modules
        - mountPath: /lib/modules
          name: lib-modules
          readOnly: true
        - mountPath: /run/xtables.lock
          name: xtables-lock
      hostNetwork: true
      initContainers:
      - command:
        - /init-container.sh
        env:
        - name: CILIUM_ALL_STATE
          valueFrom:
            configMapKeyRef:
              key: clean-cilium-state
              name: cilium-config
              optional: true
        - name: CILIUM_BPF_STATE
          valueFrom:
            configMapKeyRef:
              key: clean-cilium-bpf-state
              name: cilium-config
              optional: true
        - name: CILIUM_WAIT_BPF_MOUNT
          valueFrom:
            configMapKeyRef:
              key: wait-bpf-mount
              name: cilium-config
              optional: true
        image: "docker.io/cilium/cilium:v1.7.6"
        imagePullPolicy: IfNotPresent
        name: clean-cilium-state
        securityContext:
          capabilities:
            add:
            - NET_ADMIN
          privileged: true
        volumeMounts:
        - mountPath: /var/run/cilium
          name: cilium-run
      restartPolicy: Always
      priorityClassName: system-node-critical
      serviceAccount: cilium
      serviceAccountName: cilium
      terminationGracePeriodSeconds: 1
      tolerations:
      - operator: Exists
      volumes:
        # To keep state between restarts / upgrades
      - hostPath:
          path: /var/run/cilium
          type: DirectoryOrCreate
        name: cilium-run
      # To install cilium cni plugin in the host
      - hostPath:
          path:  /opt/cni/bin
          type: DirectoryOrCreate
        name: cni-path
        # To install cilium cni configuration in the host
      - hostPath:
          path: /etc/cni/net.d
          type: DirectoryOrCreate
        name: etc-cni-netd
        # To be able to load kernel modules
      - hostPath:
          path: /lib/modules
        name: lib-modules
        # To access iptables concurrently with other processes (e.g. kube-proxy)
      - hostPath:
          path: /run/xtables.lock
          type: FileOrCreate
        name: xtables-lock
        # To read the clustermesh configuration
      - name: clustermesh-secrets
        secret:
          defaultMode: 420
          optional: true
          secretName: cilium-clustermesh
        # To read the configuration from the config map
      - configMap:
          name: cilium-config
        name: cilium-config-path
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 2
    type: RollingUpdate
---
# Source: cilium/charts/operator/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    io.cilium/app: operator
    name: cilium-operator
  name: cilium-operator
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      io.cilium/app: operator
      name: cilium-operator
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      annotations:
      labels:
        io.cilium/app: operator
        name: cilium-operator
    spec:
      containers:
      - args:
        - --debug=$(CILIUM_DEBUG)
        - --identity-allocation-mode=$(CILIUM_IDENTITY_ALLOCATION_MODE)
        - --synchronize-k8s-nodes=true
        command:
        - cilium-operator
        env:
        - name: CILIUM_K8S_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: K8S_NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: CILIUM_DEBUG
          valueFrom:
            configMapKeyRef:
              key: debug
              name: cilium-config
              optional: true
        - name: CILIUM_CLUSTER_NAME
          valueFrom:
            configMapKeyRef:
              key: cluster-name
              name: cilium-config
              optional: true
        - name: CILIUM_CLUSTER_ID
          valueFrom:
            configMapKeyRef:
              key: cluster-id
              name: cilium-config
              optional: true
        - name: CILIUM_IPAM
          valueFrom:
            configMapKeyRef:
              key: ipam
              name: cilium-config
              optional: true
        - name: CILIUM_DISABLE_ENDPOINT_CRD
          valueFrom:
            configMapKeyRef:
              key: disable-endpoint-crd
              name: cilium-config
              optional: true
        - name: CILIUM_KVSTORE
          valueFrom:
            configMapKeyRef:
              key: kvstore
              name: cilium-config
              optional: true
        - name: CILIUM_KVSTORE_OPT
          valueFrom:
            configMapKeyRef:
              key: kvstore-opt
              name: cilium-config
              optional: true
        - name: AWS_ACCESS_KEY_ID
          valueFrom:
            secretKeyRef:
              key: AWS_ACCESS_KEY_ID
              name: cilium-aws
              optional: true
        - name: AWS_SECRET_ACCESS_KEY
          valueFrom:
            secretKeyRef:
              key: AWS_SECRET_ACCESS_KEY
              name: cilium-aws
              optional: true
        - name: AWS_DEFAULT_REGION
          valueFrom:
            secretKeyRef:
              key: AWS_DEFAULT_REGION
              name: cilium-aws
              optional: true
        - name: CILIUM_IDENTITY_ALLOCATION_MODE
          valueFrom:
            configMapKeyRef:
              key: identity-allocation-mode
              name: cilium-config
              optional: true
        image: "docker.io/cilium/operator:v1.7.6"
        imagePullPolicy: IfNotPresent
        name: cilium-operator
        livenessProbe:
          httpGet:
            host: '127.0.0.1'
            path: /healthz
            port: 9234
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 10
          timeoutSeconds: 3
      hostNetwork: true
      restartPolicy: Always
      serviceAccount: cilium-operator
      serviceAccountName: cilium-operator

dilyevsky on 10 Jul 2020

Here's the scheduler log:
I0709 23:08:22.055830 I0709 23:08:22.056081 I0709 23:08:23.137451 W0709 23:08:33.843509 W0709 23:08:33.843671 W0709 23:08:33.843710 I0709 23:08:33.911805 I0709 23:08:33.911989 W0709 23:08:33.917999 W0709 23:08:33.918162 I0709 23:08:33.918238 I0709 23:08:33.925860 I0709 23:08:33.926013 I0709 23:08:33.930685 I0709 23:08:33.936198 I0709 23:08:34.026382 I0709 23:08:34.036998 I0709 23:08:50.597201 E0709 23:08:50.658551 E0709 23:12:27.673854 E0709 23:12:58.099432 1 registry.go:150] Registering EvenPodsSpread predicate and priority function 1 registry.go:150] Registering EvenPodsSpread predicate and priority function 1 serving.go:313] Generated self-signed cert in-memory 1 authentication.go:297] Error looking up in-cluster authentication configuration: etcdserver: request timed out 1 authentication.go:298] Continuing without authentication configuration. This may treat all requests as anonymous. 1 authentication.go:299] To require authentication configuration lookup to succeed, set --authentication-tolerate-lookup-failure=false 1 registry.go:150] Registering EvenPodsSpread predicate and priority function 1 registry.go:150] Registering EvenPodsSpread predicate and priority function 1 authorization.go:47] Authorization is disabled 1 authentication.go:40] Authentication is disabled 1 deprecated_insecure_serving.go:51] Serving healthz insecurely on [::]:10251 1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::client-ca-file 1 shared_informer.go:223] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file 1 secure_serving.go:178] Serving securely on 127.0.0.1:10259 1 tlsconfig.go:240] Starting DynamicServingCertificateController 1 shared_informer.go:230] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file 1 leaderelection.go:242] attempting to acquire leader lease kube-system/kube-scheduler... 1 leaderelection.go:252] successfully acquired lease kube-system/kube-scheduler 1 factory.go:503] pod: kube-system/coredns-66bff467f8-9rjvd is already present in the active queue 1 factory.go:503] pod kube-system/cilium-vv466 is already present in the backoff queue 1 leaderelection.go:320] error retrieving resource lock kube-system/kube-scheduler: etcdserver: leader changed

After restarting scheduler pods, the pending pod immediately schedules.

dilyevsky on 10 Jul 2020

What pod events do you get? Do you know if there are taints in the node
where it doesn't get scheduled? Does it only fail for master nodes or any
nodes? Is there enough space in the node?

On Thu., Jul. 9, 2020, 7:49 p.m. dilyevsky, notifications@github.com
wrote:

It appears to be just the metadata.name of the node for the ds pod...
weird. Here's the pod yaml:

apiVersion: v1kind: Podmetadata:
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
creationTimestamp: "2020-07-09T23:17:53Z"
generateName: cilium-
labels:
controller-revision-hash: 6c94db8bb8
k8s-app: cilium
pod-template-generation: "1"
managedFields:
# managed fields crap
name: cilium-d5n4f
namespace: kube-system
ownerReferences:

apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: DaemonSet
name: cilium
uid: 0f00e8af-eb19-4985-a940-a02fa84fcbc5
resourceVersion: "2840"
selfLink: /api/v1/namespaces/kube-system/pods/cilium-d5n4f
uid: e3f7d566-ee5b-4557-8d1b-f0964cde2f22spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchFields:
- key: metadata.name
operator: In
values:
- us-central1-dilyevsky-master-qmwnl
containers:

args:

--config-dir=/tmp/cilium/config-map

command:

cilium-agent

env:

name: K8S_NODE_NAME

valueFrom:

fieldRef:

apiVersion: v1

fieldPath: spec.nodeName

name: CILIUM_K8S_NAMESPACE

valueFrom:

fieldRef:

apiVersion: v1

fieldPath: metadata.namespace

name: CILIUM_FLANNEL_MASTER_DEVICE

valueFrom:

configMapKeyRef:

key: flannel-master-device

name: cilium-config

optional: true

name: CILIUM_FLANNEL_UNINSTALL_ON_EXIT

valueFrom:

configMapKeyRef:

key: flannel-uninstall-on-exit

name: cilium-config

optional: true

name: CILIUM_CLUSTERMESH_CONFIG

value: /var/lib/cilium/clustermesh/

name: CILIUM_CNI_CHAINING_MODE

valueFrom:

configMapKeyRef:

key: cni-chaining-mode

name: cilium-config

optional: true

name: CILIUM_CUSTOM_CNI_CONF

valueFrom:

configMapKeyRef:

key: custom-cni-conf

name: cilium-config

optional: true

image: docker.io/cilium/cilium:v1.7.6

imagePullPolicy: IfNotPresent

lifecycle:

postStart:

exec:

command:

/cni-install.sh

--enable-debug=false

preStop:

exec:

command:

/cni-uninstall.sh

livenessProbe:

exec:

command:

cilium

status

--brief

failureThreshold: 10

initialDelaySeconds: 120

periodSeconds: 30

successThreshold: 1

timeoutSeconds: 5

name: cilium-agent

readinessProbe:

exec:

command:

cilium

status

--brief

failureThreshold: 3

initialDelaySeconds: 5

periodSeconds: 30

successThreshold: 1

timeoutSeconds: 5

resources: {}

securityContext:

capabilities:

add:

NET_ADMIN

SYS_MODULE

privileged: true

terminationMessagePath: /dev/termination-log

terminationMessagePolicy: File

volumeMounts:

mountPath: /var/run/cilium

name: cilium-run

mountPath: /host/opt/cni/bin

name: cni-path

mountPath: /host/etc/cni/net.d

name: etc-cni-netd

mountPath: /var/lib/cilium/clustermesh

name: clustermesh-secrets

readOnly: true

mountPath: /tmp/cilium/config-map

name: cilium-config-path

readOnly: true

mountPath: /lib/modules

name: lib-modules

readOnly: true

mountPath: /run/xtables.lock

name: xtables-lock

mountPath: /var/run/secrets/kubernetes.io/serviceaccount

name: cilium-token-j74lr

readOnly: true

dnsPolicy: ClusterFirst

enableServiceLinks: true

hostNetwork: true

initContainers:

command:

/init-container.sh

env:

name: CILIUM_ALL_STATE

valueFrom:

configMapKeyRef:

key: clean-cilium-state

name: cilium-config

optional: true

name: CILIUM_BPF_STATE

valueFrom:

configMapKeyRef:

key: clean-cilium-bpf-state

name: cilium-config

optional: true

name: CILIUM_WAIT_BPF_MOUNT

valueFrom:

configMapKeyRef:

key: wait-bpf-mount

name: cilium-config

optional: true

image: docker.io/cilium/cilium:v1.7.6

imagePullPolicy: IfNotPresent

name: clean-cilium-state

resources: {}

securityContext:

capabilities:

add:

NET_ADMIN

privileged: true

terminationMessagePath: /dev/termination-log

terminationMessagePolicy: File

volumeMounts:

mountPath: /var/run/cilium

name: cilium-run

mountPath: /var/run/secrets/kubernetes.io/serviceaccount

name: cilium-token-j74lr

readOnly: true

priority: 2000001000

priorityClassName: system-node-critical

restartPolicy: Always

schedulerName: default-scheduler

securityContext: {}

serviceAccount: cilium

serviceAccountName: cilium

terminationGracePeriodSeconds: 1

tolerations:

operator: Exists

effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists

effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists

effect: NoSchedule
key: node.kubernetes.io/disk-pressure
operator: Exists

effect: NoSchedule
key: node.kubernetes.io/memory-pressure
operator: Exists

effect: NoSchedule
key: node.kubernetes.io/pid-pressure
operator: Exists

effect: NoSchedule
key: node.kubernetes.io/unschedulable
operator: Exists

effect: NoSchedule
key: node.kubernetes.io/network-unavailable
operator: Exists
volumes:

hostPath:
path: /var/run/cilium
type: DirectoryOrCreate
name: cilium-run

hostPath:
path: /opt/cni/bin
type: DirectoryOrCreate
name: cni-path

hostPath:
path: /etc/cni/net.d
type: DirectoryOrCreate
name: etc-cni-netd

hostPath:
path: /lib/modules
type: ""
name: lib-modules

hostPath:
path: /run/xtables.lock
type: FileOrCreate
name: xtables-lock

name: clustermesh-secrets
secret:
defaultMode: 420
optional: true
secretName: cilium-clustermesh

configMap:
defaultMode: 420
name: cilium-config
name: cilium-config-path

name: cilium-token-j74lr
secret:
defaultMode: 420
secretName: cilium-token-j74lrstatus:
conditions:

lastProbeTime: null
lastTransitionTime: "2020-07-09T23:17:53Z"
message: '0/6 nodes are available: 5 node(s) didn''t match node selector.'
reason: Unschedulable
status: "False"
type: PodScheduled
phase: Pending
qosClass: BestEffort

The way I reproduce this is by spinning up new cluster with 2 masters and
3 worker nodes (using Cluster API) and applying Cilium 1.7.6:

---# Source: cilium/charts/agent/templates/serviceaccount.yamlapiVersion: v1kind: ServiceAccountmetadata:
name: cilium
namespace: kube-system
---# Source: cilium/charts/operator/templates/serviceaccount.yamlapiVersion: v1kind: ServiceAccountmetadata:
name: cilium-operator
namespace: kube-system
---# Source: cilium/charts/config/templates/configmap.yamlapiVersion: v1kind: ConfigMapmetadata:
name: cilium-config
namespace: kube-systemdata:

# Identity allocation mode selects how identities are shared between cilium
# nodes by setting how they are stored. The options are "crd" or "kvstore".
# - "crd" stores identities in kubernetes as CRDs (custom resource definition).
# These can be queried with:
# kubectl get ciliumid
# - "kvstore" stores identities in a kvstore, etcd or consul, that is
# configured below. Cilium versions before 1.6 supported only the kvstore
# backend. Upgrades from these older cilium versions should continue using
# the kvstore by commenting out the identity-allocation-mode below, or
# setting it to "kvstore".
identity-allocation-mode: crd

# If you want to run cilium in debug mode change this value to true
debug: "false"

# Enable IPv4 addressing. If enabled, all endpoints are allocated an IPv4
# address.
enable-ipv4: "true"

# Enable IPv6 addressing. If enabled, all endpoints are allocated an IPv6
# address.
enable-ipv6: "false"

# If you want cilium monitor to aggregate tracing for packets, set this level
# to "low", "medium", or "maximum". The higher the level, the less packets
# that will be seen in monitor output.
monitor-aggregation: medium

# The monitor aggregation interval governs the typical time between monitor
# notification events for each allowed connection.
#
# Only effective when monitor aggregation is set to "medium" or higher.
monitor-aggregation-interval: 5s

# The monitor aggregation flags determine which TCP flags which, upon the
# first observation, cause monitor notifications to be generated.
#
# Only effective when monitor aggregation is set to "medium" or higher.
monitor-aggregation-flags: all

# ct-global-max-entries-* specifies the maximum number of connections
# supported across all endpoints, split by protocol: tcp or other. One pair
# of maps uses these values for IPv4 connections, and another pair of maps
# use these values for IPv6 connections.
#
# If these values are modified, then during the next Cilium startup the
# tracking of ongoing connections may be disrupted. This may lead to brief
# policy drops or a change in loadbalancing decisions for a connection.
#
# For users upgrading from Cilium 1.2 or earlier, to minimize disruption
# during the upgrade process, comment out these options.
bpf-ct-global-tcp-max: "524288"
bpf-ct-global-any-max: "262144"

# bpf-policy-map-max specified the maximum number of entries in endpoint
# policy map (per endpoint)
bpf-policy-map-max: "16384"

# Pre-allocation of map entries allows per-packet latency to be reduced, at
# the expense of up-front memory allocation for the entries in the maps. The
# default value below will minimize memory usage in the default installation;
# users who are sensitive to latency may consider setting this to "true".
#
# This option was introduced in Cilium 1.4. Cilium 1.3 and earlier ignore
# this option and behave as though it is set to "true".
#
# If this value is modified, then during the next Cilium startup the restore
# of existing endpoints and tracking of ongoing connections may be disrupted.
# This may lead to policy drops or a change in loadbalancing decisions for a
# connection for some time. Endpoints may need to be recreated to restore
# connectivity.
#
# If this option is set to "false" during an upgrade from 1.3 or earlier to
# 1.4 or later, then it may cause one-time disruptions during the upgrade.
preallocate-bpf-maps: "false"

# Regular expression matching compatible Istio sidecar istio-proxy
# container image names
sidecar-istio-proxy-image: "cilium/istio_proxy"

# Encapsulation mode for communication between nodes
# Possible values:
# - disabled
# - vxlan (default)
# - geneve
tunnel: vxlan

# Name of the cluster. Only relevant when building a mesh of clusters.
cluster-name: default

# DNS Polling periodically issues a DNS lookup for each matchName from
# cilium-agent. The result is used to regenerate endpoint policy.
# DNS lookups are repeated with an interval of 5 seconds, and are made for
# A(IPv4) and AAAA(IPv6) addresses. Should a lookup fail, the most recent IP
# data is used instead. An IP change will trigger a regeneration of the Cilium
# policy for each endpoint and increment the per cilium-agent policy
# repository revision.
#
# This option is disabled by default starting from version 1.4.x in favor
# of a more powerful DNS proxy-based implementation, see [0] for details.
# Enable this option if you want to use FQDN policies but do not want to use
# the DNS proxy.
#
# To ease upgrade, users may opt to set this option to "true".
# Otherwise please refer to the Upgrade Guide [1] which explains how to
# prepare policy rules for upgrade.
#
# [0] http://docs.cilium.io/en/stable/policy/language/#dns-based
# [1] http://docs.cilium.io/en/stable/install/upgrade/#changes-that-may-require-action
tofqdns-enable-poller: "false"

# wait-bpf-mount makes init container wait until bpf filesystem is mounted
wait-bpf-mount: "false"

masquerade: "true"
enable-xt-socket-fallback: "true"
install-iptables-rules: "true"
auto-direct-node-routes: "false"
kube-proxy-replacement: "probe"
enable-host-reachable-services: "false"
enable-external-ips: "false"
enable-node-port: "false"
node-port-bind-protection: "true"
enable-auto-protect-node-port-range: "true"
enable-endpoint-health-checking: "true"
enable-well-known-identities: "false"
enable-remote-node-identity: "true"
---# Source: cilium/charts/agent/templates/clusterrole.yamlapiVersion: rbac.authorization.k8s.io/v1kind: ClusterRolemetadata:
name: ciliumrules:

apiGroups:

networking.k8s.io

resources:

networkpolicies

verbs:

get

list

watch

apiGroups:

discovery.k8s.io

resources:

endpointslices

verbs:

get

list

watch

apiGroups:

""

resources:

namespaces

services

nodes

endpoints

verbs:

get

list

watch

apiGroups:

""

resources:

pods

nodes

verbs:

get

list

watch

update

apiGroups:

""

resources:

nodes

nodes/status

verbs:

patch

apiGroups:

apiextensions.k8s.io

resources:

customresourcedefinitions

verbs:

create

get

list

watch

update

apiGroups:

cilium.io

resources:

ciliumnetworkpolicies

ciliumnetworkpolicies/status

ciliumclusterwidenetworkpolicies

ciliumclusterwidenetworkpolicies/status

ciliumendpoints

ciliumendpoints/status

ciliumnodes

ciliumnodes/status

ciliumidentities

ciliumidentities/status

verbs:

'*'

---# Source: cilium/charts/operator/templates/clusterrole.yamlapiVersion: rbac.authorization.k8s.io/v1kind: ClusterRolemetadata:

name: cilium-operatorrules:

apiGroups:

""

resources:

# to automatically delete [core|kube]dns pods so that are starting to being

# managed by Cilium

pods

verbs:

get

list

watch

delete

apiGroups:

discovery.k8s.io

resources:

endpointslices

verbs:

get

list

watch

apiGroups:

""

resources:

# to automatically read from k8s and import the node's pod CIDR to cilium's

# etcd so all nodes know how to reach another pod running in in a different

# node.

nodes

# to perform the translation of a CNP that contains ToGroup to its endpoints

services

endpoints

# to check apiserver connectivity

namespaces

verbs:

get

list

watch

apiGroups:

cilium.io

resources:

ciliumnetworkpolicies

ciliumnetworkpolicies/status

ciliumclusterwidenetworkpolicies

ciliumclusterwidenetworkpolicies/status

ciliumendpoints

ciliumendpoints/status

ciliumnodes

ciliumnodes/status

ciliumidentities

ciliumidentities/status

verbs:

'*'

---# Source: cilium/charts/agent/templates/clusterrolebinding.yamlapiVersion: rbac.authorization.k8s.io/v1kind: ClusterRoleBindingmetadata:

name: ciliumroleRef:

apiGroup: rbac.authorization.k8s.io

kind: ClusterRole

name: ciliumsubjects:

kind: ServiceAccount
name: cilium
namespace: kube-system
---# Source: cilium/charts/operator/templates/clusterrolebinding.yamlapiVersion: rbac.authorization.k8s.io/v1kind: ClusterRoleBindingmetadata:
name: cilium-operatorroleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cilium-operatorsubjects:

kind: ServiceAccount
name: cilium-operator
namespace: kube-system
---# Source: cilium/charts/agent/templates/daemonset.yamlapiVersion: apps/v1kind: DaemonSetmetadata:
labels:
k8s-app: cilium
name: cilium
namespace: kube-systemspec:
selector:
matchLabels:
k8s-app: cilium
template:
metadata:
annotations:
# This annotation plus the CriticalAddonsOnly toleration makes
# cilium to be a critical pod in the cluster, which ensures cilium
# gets priority scheduling.
# https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
scheduler.alpha.kubernetes.io/critical-pod: ""
labels:
k8s-app: cilium
spec:
containers:

args:

--config-dir=/tmp/cilium/config-map

command:

cilium-agent

livenessProbe:

exec:

command:

cilium

status

--brief

failureThreshold: 10

# The initial delay for the liveness probe is intentionally large to

# avoid an endless kill & restart cycle if in the event that the initial

# bootstrapping takes longer than expected.

initialDelaySeconds: 120

periodSeconds: 30

successThreshold: 1

timeoutSeconds: 5

readinessProbe:

exec:

command:

cilium

status

--brief

failureThreshold: 3

initialDelaySeconds: 5

periodSeconds: 30

successThreshold: 1

timeoutSeconds: 5

env:

name: K8S_NODE_NAME

valueFrom:

fieldRef:

apiVersion: v1

fieldPath: spec.nodeName

name: CILIUM_K8S_NAMESPACE

valueFrom:

fieldRef:

apiVersion: v1

fieldPath: metadata.namespace

name: CILIUM_FLANNEL_MASTER_DEVICE

valueFrom:

configMapKeyRef:

key: flannel-master-device

name: cilium-config

optional: true

name: CILIUM_FLANNEL_UNINSTALL_ON_EXIT

valueFrom:

configMapKeyRef:

key: flannel-uninstall-on-exit

name: cilium-config

optional: true

name: CILIUM_CLUSTERMESH_CONFIG

value: /var/lib/cilium/clustermesh/

name: CILIUM_CNI_CHAINING_MODE

valueFrom:

configMapKeyRef:

key: cni-chaining-mode

name: cilium-config

optional: true

name: CILIUM_CUSTOM_CNI_CONF

valueFrom:

configMapKeyRef:

key: custom-cni-conf

name: cilium-config

optional: true

image: "docker.io/cilium/cilium:v1.7.6"

imagePullPolicy: IfNotPresent

lifecycle:

postStart:

exec:

command:

"/cni-install.sh"

"--enable-debug=false"

preStop:

exec:

command:

/cni-uninstall.sh

name: cilium-agent

securityContext:

capabilities:

add:

NET_ADMIN

SYS_MODULE

privileged: true

volumeMounts:

mountPath: /var/run/cilium

name: cilium-run

mountPath: /host/opt/cni/bin

name: cni-path

mountPath: /host/etc/cni/net.d

name: etc-cni-netd

mountPath: /var/lib/cilium/clustermesh

name: clustermesh-secrets

readOnly: true

mountPath: /tmp/cilium/config-map

name: cilium-config-path

readOnly: true

# Needed to be able to load kernel modules

mountPath: /lib/modules

name: lib-modules

readOnly: true

mountPath: /run/xtables.lock

name: xtables-lock

hostNetwork: true

initContainers:

command:

/init-container.sh

env:

name: CILIUM_ALL_STATE

valueFrom:

configMapKeyRef:

key: clean-cilium-state

name: cilium-config

optional: true

name: CILIUM_BPF_STATE

valueFrom:

configMapKeyRef:

key: clean-cilium-bpf-state

name: cilium-config

optional: true

name: CILIUM_WAIT_BPF_MOUNT

valueFrom:

configMapKeyRef:

key: wait-bpf-mount

name: cilium-config

optional: true

image: "docker.io/cilium/cilium:v1.7.6"

imagePullPolicy: IfNotPresent

name: clean-cilium-state

securityContext:

capabilities:

add:

NET_ADMIN

privileged: true

volumeMounts:

mountPath: /var/run/cilium

name: cilium-run

restartPolicy: Always

priorityClassName: system-node-critical

serviceAccount: cilium

serviceAccountName: cilium

terminationGracePeriodSeconds: 1

tolerations:

operator: Exists

volumes:

# To keep state between restarts / upgrades

hostPath:

path: /var/run/cilium

type: DirectoryOrCreate

name: cilium-run

# To install cilium cni plugin in the host

hostPath:

path: /opt/cni/bin

type: DirectoryOrCreate

name: cni-path

# To install cilium cni configuration in the host

hostPath:

path: /etc/cni/net.d

type: DirectoryOrCreate

name: etc-cni-netd

# To be able to load kernel modules

hostPath:

path: /lib/modules

name: lib-modules

# To access iptables concurrently with other processes (e.g. kube-proxy)

hostPath:

path: /run/xtables.lock

type: FileOrCreate

name: xtables-lock

# To read the clustermesh configuration

name: clustermesh-secrets

secret:

defaultMode: 420

optional: true

secretName: cilium-clustermesh

# To read the configuration from the config map

configMap:

name: cilium-config

name: cilium-config-path

updateStrategy:

rollingUpdate:

maxUnavailable: 2

type: RollingUpdate

---# Source: cilium/charts/operator/templates/deployment.yamlapiVersion: apps/v1kind: Deploymentmetadata:

labels:

io.cilium/app: operator

name: cilium-operator

name: cilium-operator

namespace: kube-systemspec:

replicas: 1

selector:

matchLabels:

io.cilium/app: operator

name: cilium-operator

strategy:

rollingUpdate:

maxSurge: 1

maxUnavailable: 1

type: RollingUpdate

template:

metadata:

annotations:

labels:

io.cilium/app: operator

name: cilium-operator

spec:

containers:

args:

--debug=$(CILIUM_DEBUG)

--identity-allocation-mode=$(CILIUM_IDENTITY_ALLOCATION_MODE)

--synchronize-k8s-nodes=true

command:

cilium-operator

env:

name: CILIUM_K8S_NAMESPACE

valueFrom:

fieldRef:

apiVersion: v1

fieldPath: metadata.namespace

name: K8S_NODE_NAME

valueFrom:

fieldRef:

apiVersion: v1

fieldPath: spec.nodeName

name: CILIUM_DEBUG

valueFrom:

configMapKeyRef:

key: debug

name: cilium-config

optional: true

name: CILIUM_CLUSTER_NAME

valueFrom:

configMapKeyRef:

key: cluster-name

name: cilium-config

optional: true

name: CILIUM_CLUSTER_ID

valueFrom:

configMapKeyRef:

key: cluster-id

name: cilium-config

optional: true

name: CILIUM_IPAM

valueFrom:

configMapKeyRef:

key: ipam

name: cilium-config

optional: true

name: CILIUM_DISABLE_ENDPOINT_CRD

valueFrom:

configMapKeyRef:

key: disable-endpoint-crd

name: cilium-config

optional: true

name: CILIUM_KVSTORE

valueFrom:

configMapKeyRef:

key: kvstore

name: cilium-config

optional: true

name: CILIUM_KVSTORE_OPT

valueFrom:

configMapKeyRef:

key: kvstore-opt

name: cilium-config

optional: true

name: AWS_ACCESS_KEY_ID

valueFrom:

secretKeyRef:

key: AWS_ACCESS_KEY_ID

name: cilium-aws

optional: true

name: AWS_SECRET_ACCESS_KEY

valueFrom:

secretKeyRef:

key: AWS_SECRET_ACCESS_KEY

name: cilium-aws

optional: true

name: AWS_DEFAULT_REGION

valueFrom:

secretKeyRef:

key: AWS_DEFAULT_REGION

name: cilium-aws

optional: true

name: CILIUM_IDENTITY_ALLOCATION_MODE

valueFrom:

configMapKeyRef:

key: identity-allocation-mode

name: cilium-config

optional: true

image: "docker.io/cilium/operator:v1.7.6"

imagePullPolicy: IfNotPresent

name: cilium-operator

livenessProbe:

httpGet:

host: '127.0.0.1'

path: /healthz

port: 9234

scheme: HTTP

initialDelaySeconds: 60

periodSeconds: 10

timeoutSeconds: 3

hostNetwork: true

restartPolicy: Always

serviceAccount: cilium-operator

serviceAccountName: cilium-operator

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/issues/91601#issuecomment-656404841,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAJ5E6BMTNCADT5K7D4PMF3R2ZJRVANCNFSM4NOTPEDA
.

alculquicondor on 10 Jul 2020

Could you try increasing the loglevel and using grep to filter for the node
or the pod?

On Thu., Jul. 9, 2020, 7:55 p.m. dilyevsky, notifications@github.com
wrote:

Here's the scheduler log:

I0709 23:08:22.056081 1 registry.go:150] Registering EvenPodsSpread predicate and priority function
I0709 23:08:23.137451 1 serving.go:313] Generated self-signed cert in-memory
W0709 23:08:33.843509 1 authentication.go:297] Error looking up in-cluster authentication configuration: etcdserver: request timed out
W0709 23:08:33.843671 1 authentication.go:298] Continuing without authentication configuration. This may treat all requests as anonymous.
W0709 23:08:33.843710 1 authentication.go:299] To require authentication configuration lookup to succeed, set --authentication-tolerate-lookup-failure=false
I0709 23:08:33.911805 1 registry.go:150] Registering EvenPodsSpread predicate and priority function
I0709 23:08:33.911989 1 registry.go:150] Registering EvenPodsSpread predicate and priority function
W0709 23:08:33.917999 1 authorization.go:47] Authorization is disabled
W0709 23:08:33.918162 1 authentication.go:40] Authentication is disabled
I0709 23:08:33.918238 1 deprecated_insecure_serving.go:51] Serving healthz insecurely on [::]:10251
I0709 23:08:33.925860 1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0709 23:08:33.926013 1 shared_informer.go:223] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0709 23:08:33.930685 1 secure_serving.go:178] Serving securely on 127.0.0.1:10259
I0709 23:08:33.936198 1 tlsconfig.go:240] Starting DynamicServingCertificateController
I0709 23:08:34.026382 1 shared_informer.go:230] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0709 23:08:34.036998 1 leaderelection.go:242] attempting to acquire leader lease kube-system/kube-scheduler...
I0709 23:08:50.597201 1 leaderelection.go:252] successfully acquired lease kube-system/kube-scheduler
E0709 23:08:50.658551 1 factory.go:503] pod: kube-system/coredns-66bff467f8-9rjvd is already present in the active queue
E0709 23:12:27.673854 1 factory.go:503] pod kube-system/cilium-vv466 is already present in the backoff queue
E0709 23:12:58.099432 1 leaderelection.go:320] error retrieving resource lock kube-system/kube-scheduler: etcdserver: leader changed

After restarting scheduler pods, the pending pod immediately schedules.

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/issues/91601#issuecomment-656406215,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAJ5E6E4QPGNNBFUYSZEJC3R2ZKHDANCNFSM4NOTPEDA
.

alculquicondor on 10 Jul 2020

These are events:
```Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling default-scheduler 0/6 nodes are available: 5 node(s) didn't match node selector.
Warning FailedScheduling default-scheduler 0/6 nodes are available: 5 node(s) didn't match node selector.


The node only has two taints but the pod tolerates all existing taints and yeah it seems to only happen on masters:

Taints: node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/network-unavailable:NoSchedule


There is enough space and pod is best effort with no reservation anyway:
```  Resource                   Requests    Limits
  --------                   --------    ------
  cpu                        650m (32%)  0 (0%)
  memory                     70Mi (0%)   170Mi (2%)
  ephemeral-storage          0 (0%)      0 (0%)
  hugepages-1Gi              0 (0%)      0 (0%)
  hugepages-2Mi              0 (0%)      0 (0%)
  attachable-volumes-gce-pd  0           0

I'll try increasing scheduler log level now...

dilyevsky on 10 Jul 2020

Your pod yaml doesn't actually have node-role.kubernetes.io/master toleration. So it shouldn't have been scheduled in the master.

alculquicondor on 10 Jul 2020

Hi! We are hitting the same issue. However, we see the same problem with deployments, where we use anti-affinity to make sure a pod gets scheduled on each node or a pod selector targeting the specific node.
Simply creating a pod with a node selector set to match the hostname of the failing node was sufficient to cause the scheduling to fail. It was saying that 5 nodes were not matching the selector, but nothing about the sixth. Restarting the scheduler solved the issue. That looks like something gets cached about that node and prevents the scheduling on the node.
As other people said before, we have nothing in the log about the failure.

We striped the failing deployment to the bare minimum (we had removed the taint on the master that is failing):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-deployment
  labels:
    app: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
      restartPolicy: Always
      schedulerName: default-scheduler
      nodeSelector:
        kubernetes.io/hostname: master-2

We were having the same issue when the master had a taint, and the deployment a toleration for the taint. So it does not seem related to daemonsets, tolerations or affinity / anti-affinity specifically. When the failure starts happening, nothing that targets the specific node can be scheduled. We see the issue in 1.18.2 up to 1.18.5 (did not try with 1.18.0 or .1)

maelk on 15 Jul 2020

Simply creating a pod with a node selector set to match the hostname of the failing node was sufficient to cause the scheduling to fail

Could you clarify if it starting failing after you created such pod or before? I assume this node didn't have a taint that the pod didn't tolerate.

@nodo is going to help to reproduce. Could you look at the code for NodeSelector? You might need to add extra log lines while testing. You can also print the cache.

Get PID of kube-scheduler: $ pidof kube-scheduler
Trigger queue dump: $ sudo kill -SIGUSR2 <pid>. Note this won't kill the scheduler process.
Then in scheduler log, search for strings "Dump of cached NodeInfo", "Dump of scheduling queue" and "cache comparer started".

/priority critical-urgent

alculquicondor on 15 Jul 2020

/unassign

alculquicondor on 15 Jul 2020

We were already seeing some daemonset and deployment stuck in "Pending" before we tried deploying this test deployment, so it was already failing. and the taints had been removed from the node.
Right now we lost the environment where this was happening because we had to reboot the nodes so the issue is not visible anymore. As soon as we reproduce, we will try to come back with more info

maelk on 15 Jul 2020

Please do so. I have tried to reproduce this in the past without success. I'm more interested in the first instance of failure. It might still be related to taints.

alculquicondor on 15 Jul 2020

We have reproduced the issue. I ran the command you asked for, here are the info :

I0716 14:47:52.768362       1 factory.go:462] Unable to schedule default/test-deployment-558f47bbbb-4rt5t: no fit: 0/6 nodes are available: 5 node(s) didn't match node selector.; waiting
I0716 14:47:52.768683       1 scheduler.go:776] Updating pod condition for default/test-deployment-558f47bbbb-4rt5t to (PodScheduled==False, Reason=Unschedulable)
I0716 14:47:53.018781       1 httplog.go:90] verb="GET" URI="/healthz" latency=299.172µs resp=200 UserAgent="kube-probe/1.18" srcIP="127.0.0.1:57258": 
I0716 14:47:59.469828       1 comparer.go:42] cache comparer started
I0716 14:47:59.470936       1 comparer.go:67] cache comparer finished
I0716 14:47:59.471038       1 dumper.go:47] Dump of cached NodeInfo
I0716 14:47:59.471484       1 dumper.go:49] 
Node name: master-0-bug
Requested Resources: {MilliCPU:1100 Memory:52428800 EphemeralStorage:0 AllowedPodNumber:0 ScalarResources:map[]}
Allocatable Resources:{MilliCPU:2000 Memory:3033427968 EphemeralStorage:19290208634 AllowedPodNumber:110 ScalarResources:map[hugepages-1Gi:0 hugepages-2Mi:0]}
Scheduled Pods(number: 9):
...
I0716 14:47:59.472623       1 dumper.go:60] Dump of scheduling queue:
name: coredns-cd64c8d7c-29zjq, namespace: kube-system, uid: 938e8827-5d17-4db9-ac04-d229baf4534a, phase: Pending, nominated node: 
name: test-deployment-558f47bbbb-4rt5t, namespace: default, uid: fa19fda9-c8d6-4ffe-b248-8ddd24ed5310, phase: Pending, nominated node:

Unfortunately that does not seem to help

maelk on 16 Jul 2020

Dumping the cache is for debugging, it won't change anything. Could you please include the dump?

alculquicondor on 16 Jul 2020

Also, assuming this was the first error, could you include the pod yaml and node?

alculquicondor on 16 Jul 2020

that's pretty much everything that was dumped, I just removed the other nodes. This was not the first error, but you can see coredns pod in the dump, that was the first one. I am not sure what else you are asking for in the dump.
I'll fetch the yamls

maelk on 16 Jul 2020

Thanks, I didn't realize that you had trimmed the relevant node and pod.

alculquicondor on 16 Jul 2020

Could you include the scheduled pods for that node though? Just in case there is a bug in resource usage calculations.

alculquicondor on 16 Jul 2020

Requested Resources: {MilliCPU:1100 Memory:52428800 EphemeralStorage:0 AllowedPodNumber:0 ScalarResources:map[]}

That AllowedPodNumber: 0 seems odd.

alculquicondor on 16 Jul 2020

Here are the other pods on that node :
`name: kube-controller-manager-master-0-bug, namespace: kube-system, uid: 095eebb0-4752-419b-aac7-245e5bc436b8, phase: Running, nominated node: name: kube-proxy-xwf6h, namespace: kube-system, uid: 16552eaf-9eb8-4584-ba3c-7dff6ce92592, phase: Running, nominated node: name: kube-apiserver-master-0-bug, namespace: kube-system, uid: 1d338e26-b0bc-4cef-9bad-86b7dd2b2385, phase: Running, nominated node: name: kube-multus-ds-amd64-tpkm8, namespace: kube-system, uid: d50c0c7f-599c-41d5-a029-b43352a4f5b8, phase: Running, nominated node: name: openstack-cloud-controller-manager-wrb8n, namespace: kube-system, uid: 17aeb589-84a1-4416-a701-db6d8ef60591, phase: Running, nominated node: name: kube-scheduler-master-0-bug, namespace: kube-system, uid: 52469084-3122-4e99-92f6-453e512b640f, phase: Running, nominated node: name: subport-controller-28j9v, namespace: kube-system, uid: a5a07ac8-763a-4ff2-bdae-91c6e9e95698, phase: Running, nominated node: name: csi-cinder-controllerplugin-0, namespace: kube-system, uid: 8b16d6c8-a871-454e-98a3-0aa545f9c9d0, phase: Running, nominated node: name: calico-node-d899t, namespace: kube-system, uid: e3672030-53b1-4356-a5df-0f4afd6b9237, phase: Running, nominated node:

maelk on 16 Jul 2020

All the nodes have allowedPodNumber set to 0 in the requested resources in the dump, but the other nodes are schedulable

maelk on 16 Jul 2020

👍1

The node yaml :

apiVersion: v1
kind: Node
metadata:
  annotations:
    kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
    node.alpha.kubernetes.io/ttl: "0"
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: "2020-07-16T09:59:48Z"
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/instance-type: 54019dbc-10d7-409c-8338-5556f61a9371
    beta.kubernetes.io/os: linux
    failure-domain.beta.kubernetes.io/region: regionOne
    failure-domain.beta.kubernetes.io/zone: nova
    kubernetes.io/arch: amd64
    kubernetes.io/hostname: master-0-bug
    kubernetes.io/os: linux
    node-role.kubernetes.io/master: ""
    node.kubernetes.io/instance-type: 54019dbc-10d7-409c-8338-5556f61a9371
    node.uuid: 00324054-405e-4fae-a3bf-d8509d511ded
    node.uuid_source: cloud-init
    topology.kubernetes.io/region: regionOne
    topology.kubernetes.io/zone: nova
  name: master-0-bug
  resourceVersion: "85697"
  selfLink: /api/v1/nodes/master-0-bug
  uid: 629b6ef3-3c76-455b-8b6b-196c4754fb0e
spec:
  podCIDR: 192.168.0.0/24
  podCIDRs:
  - 192.168.0.0/24
  providerID: openstack:///00324054-405e-4fae-a3bf-d8509d511ded
  taints:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
status:
  addresses:
  - address: 10.0.10.14
    type: InternalIP
  - address: master-0-bug
    type: Hostname
  allocatable:
    cpu: "2"
    ephemeral-storage: "19290208634"
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 2962332Ki
    pods: "110"
  capacity:
    cpu: "2"
    ephemeral-storage: 20931216Ki
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 3064732Ki
    pods: "110"
  conditions:
  - lastHeartbeatTime: "2020-07-16T10:02:20Z"
    lastTransitionTime: "2020-07-16T10:02:20Z"
    message: Calico is running on this node
    reason: CalicoIsUp
    status: "False"
    type: NetworkUnavailable
  - lastHeartbeatTime: "2020-07-16T15:46:11Z"
    lastTransitionTime: "2020-07-16T09:59:43Z"
    message: kubelet has sufficient memory available
    reason: KubeletHasSufficientMemory
    status: "False"
    type: MemoryPressure
  - lastHeartbeatTime: "2020-07-16T15:46:11Z"
    lastTransitionTime: "2020-07-16T09:59:43Z"
    message: kubelet has no disk pressure
    reason: KubeletHasNoDiskPressure
    status: "False"
    type: DiskPressure
  - lastHeartbeatTime: "2020-07-16T15:46:11Z"
    lastTransitionTime: "2020-07-16T09:59:43Z"
    message: kubelet has sufficient PID available
    reason: KubeletHasSufficientPID
    status: "False"
    type: PIDPressure
  - lastHeartbeatTime: "2020-07-16T15:46:11Z"
    lastTransitionTime: "2020-07-16T10:19:44Z"
    message: kubelet is posting ready status. AppArmor enabled
    reason: KubeletReady
    status: "True"
    type: Ready
  daemonEndpoints:
    kubeletEndpoint:
      Port: 10250
  nodeInfo:
    architecture: amd64
    bootID: fe410ed3-2825-4f94-a9f9-08dc5e6a955e
    containerRuntimeVersion: docker://19.3.11
    kernelVersion: 4.12.14-197.45-default
    kubeProxyVersion: v1.18.5
    kubeletVersion: v1.18.5
    machineID: 00324054405e4faea3bfd8509d511ded
    operatingSystem: linux
    systemUUID: 00324054-405e-4fae-a3bf-d8509d511ded

and the pod :

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2020-07-16T10:13:35Z"
  generateName: pm-node-exporter-
  labels:
    controller-revision-hash: 6466d9c7b
    pod-template-generation: "1"
  name: pm-node-exporter-mn9vj
  namespace: monitoring
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: DaemonSet
    name: pm-node-exporter
    uid: 5855a26f-a57e-4b0e-93f2-461c19c477e1
  resourceVersion: "5239"
  selfLink: /api/v1/namespaces/monitoring/pods/pm-node-exporter-mn9vj
  uid: 0db09c9c-1618-4454-94fa-138e55e5ebd7
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchFields:
          - key: metadata.name
            operator: In
            values:
            - master-0-bug
  containers:
  - args:
    - --path.procfs=/host/proc
    - --path.sysfs=/host/sys
    image: ***
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /
        port: 9100
        scheme: HTTP
      initialDelaySeconds: 5
      periodSeconds: 5
      successThreshold: 1
      timeoutSeconds: 1
    name: pm-node-exporter
    ports:
    - containerPort: 9100
      hostPort: 9100
      name: metrics
      protocol: TCP
    resources:
      limits:
        cpu: 200m
        memory: 150Mi
      requests:
        cpu: 100m
        memory: 100Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /host/proc
      name: proc
      readOnly: true
    - mountPath: /host/sys
      name: sys
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: pm-node-exporter-token-csllf
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostNetwork: true
  hostPID: true
  nodeSelector:
    node-role.kubernetes.io/master: ""
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: pm-node-exporter
  serviceAccountName: pm-node-exporter
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/disk-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/pid-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/network-unavailable
    operator: Exists
  volumes:
  - hostPath:
      path: /proc
      type: ""
    name: proc
  - hostPath:
      path: /sys
      type: ""
    name: sys
  - name: pm-node-exporter-token-csllf
    secret:
      defaultMode: 420
      secretName: pm-node-exporter-token-csllf
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2020-07-16T10:13:35Z"
    message: '0/6 nodes are available: 2 node(s) didn''t have free ports for the requested
      pod ports, 3 node(s) didn''t match node selector.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending
  qosClass: Burstable

maelk on 16 Jul 2020

Thanks a lot for all the information. @nodo can you take it?

alculquicondor on 16 Jul 2020

We're also trying with https://github.com/Nordix/kubernetes/commit/5c00cdf195fa61316f963f59e73c6cafc2ad9bdc now, to get more info

maelk on 16 Jul 2020

👍1

/help

@maelk feel free to take this and submit a PR if you find the bug. The log lines you added are likely to be helpful. Otherwise, I'm opening to contributors.

alculquicondor on 16 Jul 2020

@alculquicondor:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

@maelk feel free to take this and submit a PR if you find the bug. The log lines you added are likely to be helpful. Otherwise, I'm opening to contributors.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot on 16 Jul 2020

/assign

pancernik on 17 Jul 2020

@maelk Is there anything specific to timing when this issue occurs first time? For example, does it happen right after node starts?

pancernik on 17 Jul 2020

No, there quite some pods that get scheduled there and run fine. But once the issue happens no of can be scheduled anymore.

maelk on 17 Jul 2020

Lowering priority until we have a reproducible case.

liggitt on 19 Jul 2020

We were able to reproduce the bug with a scheduler that had the additional log entries. What we see is that one of the masters completely disappears from the list of nodes that are iterated through. We can see that the process starts with the 6 nodes (from the snapshot) :

I0720 13:58:28.246507       1 generic_scheduler.go:441] Looking for a node for kube-system/coredns-cd64c8d7c-tcxbq, going through []*nodeinfo.NodeInfo{(*nodeinfo.NodeInfo)(0xc000326a90), (*nodeinfo.NodeInfo)(0xc000952000), (*nodeinfo.NodeInfo)(0xc0007d08f0), (*nodeinfo.NodeInfo)(0xc0004f35f0), (*nodeinfo.NodeInfo)(0xc000607040), (*nodeinfo.NodeInfo)(0xc000952000)}

but after, that we can see that it iterates only over 5 nodes, and then we get :

I0720 13:58:28.247420       1 generic_scheduler.go:505] pod kube-system/coredns-cd64c8d7c-tcxbq : processed 5 nodes, 0 fit

So one of the nodes is removed from the list of potential nodes. Unfortunately we did not have enough loging at the start of the process, but we'll try to get more.

maelk on 20 Jul 2020

Code references by log line:

https://github.com/Nordix/kubernetes/commit/5c00cdf195fa61316f963f59e73c6cafc2ad9bdc#diff-c237cdd9e4cb201118ca380732d7f361R441
https://github.com/Nordix/kubernetes/commit/5c00cdf195fa61316f963f59e73c6cafc2ad9bdc#diff-c237cdd9e4cb201118ca380732d7f361R505

@maelk
Did you see any lines for %v/%v on node %v, too many nodes fit?

Otherwise, @pancernik could you check for bugs on workqueue.ParallelizeUntil(ctx, 16, len(allNodes), checkNode)?

alculquicondor on 20 Jul 2020

No, that log did not appear. I would also think it could be that we either have an issue with the parallelization or that node is filtered out earlier. If it was failing with an error here : https://github.com/Nordix/kubernetes/commit/5c00cdf195fa61316f963f59e73c6cafc2ad9bdc#diff-c237cdd9e4cb201118ca380732d7f361R464 it would be visible in the logs afaik, so I'll try to add more debug entries around specifically the function and the parallelization.

maelk on 20 Jul 2020

I just realized that one node is going through the filtering twice!

maelk on 20 Jul 2020

The logs are :

I0720 13:58:28.246507       1 generic_scheduler.go:441] Looking for a node for kube-system/coredns-cd64c8d7c-tcxbq, going through []*nodeinfo.NodeInfo{(*nodeinfo.NodeInfo)(0xc000326a90), (*nodeinfo.NodeInfo)(0xc000952000), (*nodeinfo.NodeInfo)(0xc0007d08f0), (*nodeinfo.NodeInfo)(0xc0004f35f0), (*nodeinfo.NodeInfo)(0xc000607040), (*nodeinfo.NodeInfo)(0xc000952000)}
I0720 13:58:28.246793       1 generic_scheduler.go:469] pod kube-system/coredns-cd64c8d7c-tcxbq on node worker-pool1-60846k0y-scheduler, fits: false, status: &v1alpha1.Status{code:3, reasons:[]string{"node(s) didn't match node selector"}}
I0720 13:58:28.246970       1 generic_scheduler.go:483] pod kube-system/coredns-cd64c8d7c-tcxbq on node worker-pool1-60846k0y-scheduler : status is not success
I0720 13:58:28.246819       1 taint_toleration.go:71] Checking taints for pod kube-system/coredns-cd64c8d7c-tcxbq for node master-0-scheduler : taints : []v1.Taint{v1.Taint{Key:"node-role.kubernetes.io/master", Value:"", Effect:"NoSchedule", TimeAdded:(*v1.Time)(nil)}} and tolerations: []v1.Toleration{v1.Toleration{Key:"node-role.kubernetes.io/master", Operator:"Exists", Value:"", Effect:"NoSchedule", TolerationSeconds:(*int64)(nil)}, v1.Toleration{Key:"CriticalAddonsOnly", Operator:"Exists", Value:"", Effect:"NoSchedule", TolerationSeconds:(*int64)(nil)}, v1.Toleration{Key:"node-role.kubernetes.io/master", Operator:"Exists", Value:"", Effect:"NoExecute", TolerationSeconds:(*int64)(nil)}, v1.Toleration{Key:"node-role.kubernetes.io/not-ready", Operator:"Exists", Value:"", Effect:"NoSchedule", TolerationSeconds:(*int64)(nil)}, v1.Toleration{Key:"node.kubernetes.io/not-ready", Operator:"Exists", Value:"", Effect:"NoExecute", TolerationSeconds:(*int64)(0xc000d40d90)}, v1.Toleration{Key:"node.kubernetes.io/unreachable", Operator:"Exists", Value:"", Effect:"NoExecute", TolerationSeconds:(*int64)(0xc000d40db0)}}
I0720 13:58:28.247019       1 taint_toleration.go:71] Checking taints for pod kube-system/coredns-cd64c8d7c-tcxbq for node master-2-scheduler : taints : []v1.Taint{v1.Taint{Key:"node-role.kubernetes.io/master", Value:"", Effect:"NoSchedule", TimeAdded:(*v1.Time)(nil)}} and tolerations: []v1.Toleration{v1.Toleration{Key:"node-role.kubernetes.io/master", Operator:"Exists", Value:"", Effect:"NoSchedule", TolerationSeconds:(*int64)(nil)}, v1.Toleration{Key:"CriticalAddonsOnly", Operator:"Exists", Value:"", Effect:"NoSchedule", TolerationSeconds:(*int64)(nil)}, v1.Toleration{Key:"node-role.kubernetes.io/master", Operator:"Exists", Value:"", Effect:"NoExecute", TolerationSeconds:(*int64)(nil)}, v1.Toleration{Key:"node-role.kubernetes.io/not-ready", Operator:"Exists", Value:"", Effect:"NoSchedule", TolerationSeconds:(*int64)(nil)}, v1.Toleration{Key:"node.kubernetes.io/not-ready", Operator:"Exists", Value:"", Effect:"NoExecute", TolerationSeconds:(*int64)(0xc000d40d90)}, v1.Toleration{Key:"node.kubernetes.io/unreachable", Operator:"Exists", Value:"", Effect:"NoExecute", TolerationSeconds:(*int64)(0xc000d40db0)}}
I0720 13:58:28.247144       1 generic_scheduler.go:469] pod kube-system/coredns-cd64c8d7c-tcxbq on node master-2-scheduler, fits: false, status: &v1alpha1.Status{code:2, reasons:[]string{"node(s) didn't match pod affinity/anti-affinity", "node(s) didn't satisfy existing pods anti-affinity rules"}}
I0720 13:58:28.247172       1 generic_scheduler.go:483] pod kube-system/coredns-cd64c8d7c-tcxbq on node master-2-scheduler : status is not success
I0720 13:58:28.247210       1 generic_scheduler.go:469] pod kube-system/coredns-cd64c8d7c-tcxbq on node worker-pool1-7dt1xd4k-scheduler, fits: false, status: &v1alpha1.Status{code:3, reasons:[]string{"node(s) didn't match node selector"}}
I0720 13:58:28.247231       1 generic_scheduler.go:483] pod kube-system/coredns-cd64c8d7c-tcxbq on node worker-pool1-7dt1xd4k-scheduler : status is not success
I0720 13:58:28.247206       1 generic_scheduler.go:469] pod kube-system/coredns-cd64c8d7c-tcxbq on node worker-pool1-60846k0y-scheduler, fits: false, status: &v1alpha1.Status{code:3, reasons:[]string{"node(s) didn't match node selector"}}
I0720 13:58:28.247297       1 generic_scheduler.go:483] pod kube-system/coredns-cd64c8d7c-tcxbq on node worker-pool1-60846k0y-scheduler : status is not success
I0720 13:58:28.247246       1 generic_scheduler.go:469] pod kube-system/coredns-cd64c8d7c-tcxbq on node worker-pool1-hyk0hg7r-scheduler, fits: false, status: &v1alpha1.Status{code:3, reasons:[]string{"node(s) didn't match node selector"}}
I0720 13:58:28.247340       1 generic_scheduler.go:483] pod kube-system/coredns-cd64c8d7c-tcxbq on node worker-pool1-hyk0hg7r-scheduler : status is not success
I0720 13:58:28.247147       1 generic_scheduler.go:469] pod kube-system/coredns-cd64c8d7c-tcxbq on node master-0-scheduler, fits: false, status: &v1alpha1.Status{code:2, reasons:[]string{"node(s) didn't match pod affinity/anti-affinity", "node(s) didn't satisfy existing pods anti-affinity rules"}}
I0720 13:58:28.247375       1 generic_scheduler.go:483] pod kube-system/coredns-cd64c8d7c-tcxbq on node master-0-scheduler : status is not success
I0720 13:58:28.247420       1 generic_scheduler.go:505] pod kube-system/coredns-cd64c8d7c-tcxbq : processed 5 nodes, 0 fit
I0720 13:58:28.247461       1 generic_scheduler.go:430] pod kube-system/coredns-cd64c8d7c-tcxbq After scheduling, filtered: []*v1.Node{}, filtered nodes: v1alpha1.NodeToStatusMap{"master-0-scheduler":(*v1alpha1.Status)(0xc000d824a0), "master-2-scheduler":(*v1alpha1.Status)(0xc000b736c0), "worker-pool1-60846k0y-scheduler":(*v1alpha1.Status)(0xc000d825a0), "worker-pool1-7dt1xd4k-scheduler":(*v1alpha1.Status)(0xc000b737e0), "worker-pool1-hyk0hg7r-scheduler":(*v1alpha1.Status)(0xc000b738c0)}
I0720 13:58:28.247527       1 generic_scheduler.go:185] Pod kube-system/coredns-cd64c8d7c-tcxbq failed scheduling:
  nodes snapshot: &cache.Snapshot{nodeInfoMap:map[string]*nodeinfo.NodeInfo{"master-0-scheduler":(*nodeinfo.NodeInfo)(0xc000607040), "master-1-scheduler":(*nodeinfo.NodeInfo)(0xc0001071e0), "master-2-scheduler":(*nodeinfo.NodeInfo)(0xc000326a90), "worker-pool1-60846k0y-scheduler":(*nodeinfo.NodeInfo)(0xc000952000), "worker-pool1-7dt1xd4k-scheduler":(*nodeinfo.NodeInfo)(0xc0007d08f0), "worker-pool1-hyk0hg7r-scheduler":(*nodeinfo.NodeInfo)(0xc0004f35f0)}, nodeInfoList:[]*nodeinfo.NodeInfo{(*nodeinfo.NodeInfo)(0xc000326a90), (*nodeinfo.NodeInfo)(0xc000952000), (*nodeinfo.NodeInfo)(0xc0007d08f0), (*nodeinfo.NodeInfo)(0xc0004f35f0), (*nodeinfo.NodeInfo)(0xc000607040), (*nodeinfo.NodeInfo)(0xc000952000)}, havePodsWithAffinityNodeInfoList:[]*nodeinfo.NodeInfo{(*nodeinfo.NodeInfo)(0xc000326a90), (*nodeinfo.NodeInfo)(0xc000607040)}, generation:857} 
  statuses: v1alpha1.NodeToStatusMap{"master-0-scheduler":(*v1alpha1.Status)(0xc000d824a0), "master-2-scheduler":(*v1alpha1.Status)(0xc000b736c0), "worker-pool1-60846k0y-scheduler":(*v1alpha1.Status)(0xc000d825a0), "worker-pool1-7dt1xd4k-scheduler":(*v1alpha1.Status)(0xc000b737e0), "worker-pool1-hyk0hg7r-scheduler":(*v1alpha1.Status)(0xc000b738c0)}

As you can see the node worker-pool1-60846k0y-scheduler goes twice through filtering

maelk on 20 Jul 2020

No, that log did not appear. I would also think it could be that we either have an issue with the parallelization or that node is filtered out earlier. If it was failing with an error here : Nordix@5c00cdf#diff-c237cdd9e4cb201118ca380732d7f361R464 it would be visible in the logs afaik, so I'll try to add more debug entries around specifically the function and the parallelization.

Yeah, an error there would manifest as an Scheduling Error in the pod events.

I just realized that one node is going through the filtering twice!

I honestly wouldn't think that parallelization has bugs (still worth checking), but this could be a sign that we failed to build the snapshot from the cache (as we saw from the cache dump, the cache is correct), by adding a node twice. Since statuses is a map, then it makes sense that we only "see" 5 nodes at the last log line.

This is the code (tip of 1.18) https://github.com/kubernetes/kubernetes/blob/ec73e191f47b7992c2f40fadf1389446d6661d6d/pkg/scheduler/internal/cache/cache.go#L203

alculquicondor on 20 Jul 2020

cc @ahg-g

alculquicondor on 20 Jul 2020

I will try to add a lot of logs on the cache part of the scheduler, specifically around node addition and update, and around the snapshot. However, from the last line of the logs, you can see that the snapshot is actually correct, and contains all the nodes, so whatever happens seem to happen later on, when working over that snapshot

maelk on 20 Jul 2020

cache != snapshot

Cache is the living thing that gets updated from events. The snapshot is updated (from the cache) before each scheduling cycle to "lock" the state. We added optimizations to make this last process as fast as possible. It's possible that the bug is there.

alculquicondor on 20 Jul 2020

Thanks @maelk! This is very useful. Your logs indicate that (*nodeinfo.NodeInfo)(0xc000952000) is duplicated in the list already at https://github.com/Nordix/kubernetes/commit/5c00cdf195fa61316f963f59e73c6cafc2ad9bdc#diff-c237cdd9e4cb201118ca380732d7f361R441 before any parallel code gets executed. That would indeed mean that it's duplicated before the snapshot is updated.

pancernik on 20 Jul 2020

Actually, that comes from the snapshot, that happens before this log message : https://github.com/Nordix/kubernetes/commit/5c00cdf195fa61316f963f59e73c6cafc2ad9bdc#diff-c237cdd9e4cb201118ca380732d7f361R161 . So it rather looks like the content of the snapshot has the duplication, since it comes from https://github.com/Nordix/kubernetes/commit/5c00cdf195fa61316f963f59e73c6cafc2ad9bdc#diff-c237cdd9e4cb201118ca380732d7f361R436

maelk on 20 Jul 2020

That's right. I meant it's already duplicated before the update of the snapshot finishes.

pancernik on 20 Jul 2020

👍1

That's right. I meant it's already duplicated before the update of the snapshot finishes.

No, The snapshot is updated at the start of the scheduling cycle. The bug is either during snapshot update or before that. But the cache is correct, according to the dump in https://github.com/kubernetes/kubernetes/issues/91601#issuecomment-659465008

EDIT: I read it wrong, I didn't see the word "finishes" :)

alculquicondor on 20 Jul 2020

The PR optimizing update snapshot was done in 1.18: https://github.com/kubernetes/kubernetes/pull/85738 and https://github.com/kubernetes/kubernetes/pull/86919

ahg-g on 20 Jul 2020

I wonder if the node tree also has duplicate records

ahg-g on 20 Jul 2020

I wonder if the node tree also has duplicate records

@maelk could you show a dump of the full list of nodes in the cache?

alculquicondor on 20 Jul 2020

we don't add/remove items from NodeInfoList, we either create the full list from the tree or not, so if there are duplicates, those are likely coming from the tree, I think.

ahg-g on 20 Jul 2020

Just to clarify:
1) the cluster has 6 nodes (including the masters)
2) the node that is supposed to host the pod wasn't examined at all (no log line indicating that), which may mean it is not in NodeInfoList at all
3) NodeInfoList has 6 nodes, but one of them is duplicate

ahg-g on 20 Jul 2020

I wonder if the node tree also has duplicate records

@maelk could you show a dump of the full list of nodes in the cache?

a dump of each the node tree, list and map would be great.

ahg-g on 20 Jul 2020

I'll work on getting those. In the meantime, a small update. We can see in the logs :

I0720 13:37:30.530980       1 node_tree.go:100] Removed node "worker-pool1-60846k0y-scheduler" in group "" from NodeTree
I0720 13:37:30.531136       1 node_tree.go:86] Added node "worker-pool1-60846k0y-scheduler" in group "regionOne:\x00:nova" to NodeTree

And that is the exact point when the missing node disappears. The last occurence in the logs is at 13:37:24. In the next scheduling, the missing node is gone. so it looks like the bug is in /follows the update of the node_tree. All nodes go through that update, it's just that this worker 608 is the last one to go through it.

When dumping the cache (with SIGUSR2) all six nodes are listed there, with the pods running on the nodes, without duplication or missing nodes.

maelk on 21 Jul 2020

We'll give it a new try with added debug around the snapshot functionality : https://github.com/Nordix/kubernetes/commit/53279fb06536558f9a91836c771b182791153791

maelk on 21 Jul 2020

Removed node "worker-pool1-60846k0y-scheduler" in group "" from NodeTree

Interesting, I think the remove/add are triggered by an updateNode call. The zone key is missing on the remove, but exist on the add, so the update was basically adding the zone and region labels?

Do you have other scheduler logs related to this node?

ahg-g on 21 Jul 2020

We're trying to reproduce the bug with the added logging. I'll come back when I have more info

maelk on 21 Jul 2020

I'll work on getting those. In the meantime, a small update. We can see in the logs :

I0720 13:37:30.530980       1 node_tree.go:100] Removed node "worker-pool1-60846k0y-scheduler" in group "" from NodeTree
I0720 13:37:30.531136       1 node_tree.go:86] Added node "worker-pool1-60846k0y-scheduler" in group "regionOne:\x00:nova" to NodeTree

I'll point out that such node is the node that gets repeated. @maelk, did you see similar messages for other nodes or not at all? As @ahg-g, this should be expected when a node receive it's topology labels for the first time.

alculquicondor on 21 Jul 2020

yes, it happened for all nodes, and it is expected. the coincidence being that this node specifically is the last updated one, and it is at that exact time that the other node goes missing

maelk on 21 Jul 2020

Did you get update logs for the missing node?

alculquicondor on 21 Jul 2020

Did you get update logs for the missing node?

lol, was just typing this question.

ahg-g on 21 Jul 2020

perhaps the bug is that whole zone is deleted from the tree before all nodes are removed.

ahg-g on 21 Jul 2020

Just to clarify, I'm not personally looking at the code, I'm just trying to make sure we have all the information. And I think that, with what we have now, we should be able to spot the bug. Feel free to submit PRs and much better if you can provide a unit test that fails.

alculquicondor on 21 Jul 2020

Did you get update logs for the missing node?

yes, it shows that the zone is updated for that missing node. There is a log entry for all nodes

To be honest, I still have no clue about the reason of the bug, but if we can get close to finding it out, I'll submit a PR or unit tests.

maelk on 21 Jul 2020

yes, it shows that the zone is updated for that missing node. There is a log entry for all nodes

If so, then I think assuming that this "is the exact point when the missing node disappears." may not be correlated. Lets wait for the new logs. It would be great if you can share all scheduler logs you get in a file.

ahg-g on 21 Jul 2020

I'll do when we reproduce with the new logging. From the existing ones, we can actually see that the pod scheduling immediately after that update is the first one to fail. But it does not give enough information to know what happened in between, so stay tuned...

maelk on 21 Jul 2020

@maelk Have you seen a message starting with snapshot state is not consistent in the scheduler logs?

Would it be possible for you to provide full scheduler logs?

pancernik on 21 Jul 2020

no, that message is not present. I could give a striped down log file (to avoid repetition), but let's first wait until we have the output with more logs around the snapshot

maelk on 21 Jul 2020

👍1

I have found the bug. The issue is with the nodeTree next() function, that does not return a list of all nodes in some cases. https://github.com/kubernetes/kubernetes/blob/release-1.18/pkg/scheduler/internal/cache/node_tree.go#L147

It is visible if you add the following here : https://github.com/kubernetes/kubernetes/blob/release-1.18/pkg/scheduler/internal/cache/node_tree_test.go#L443

{
    name:           "add nodes to a new and to an exhausted zone",
    nodesToAdd:     append(allNodes[5:9], allNodes[3]),
    nodesToRemove:  nil,
    operations:     []string{"add", "add", "next", "next", "add", "add", "add", "next", "next", "next", "next"},
    expectedOutput: []string{"node-6", "node-7", "node-3", "node-8", "node-6", "node-7"},
},

The main problem is that when you add a node, the indexes are not at 0 for some of the zones. For this to happen you must have at least two zones, one being shorter than the other, and the longer one having an index not set to 0 when calling the next function for the first time.

The fix I went with is to reset the index before calling next() the first time. I opened a PR to show my fix. Of course it is against the 1.18 release as this is what I have been working on, but it is mostly for discussing how to fix it (or maybe fix the next() function itself). I can open a proper PR towards master and do the backports if needed afterwards.

maelk on 22 Jul 2020

I noticed the same problem with iteration. But I failed to link that to a duplicate in the snapshot. Have you managed to create a scenario where that would happen, @maelk?

pancernik on 22 Jul 2020

yes, you can run it in the unit tests by adding the small code I put

maelk on 22 Jul 2020

I am now working on adding a test case for the snapshot, to make sure this is properly tested.

maelk on 22 Jul 2020

🎉1 👍1

big thumbs up to @igraecao for the help in reproducing the issue and running the tests in his setup

maelk on 22 Jul 2020

👍1

Thanks all for debugging this notorious issue. Resetting the index before creating the list is safe, so I think we should go with that for 1.18 and 1.19 patches, and have a proper fix in the master branch.

The purpose of the next function changed with the introduction of the NodeInfoList, and so we can certainly simplify it and perhaps change it to toList, a function that creates a list from the tree and simply start from the beginning every time.

ahg-g on 22 Jul 2020

I understand the issue now: The calculation of whether or not a zone is exhausted is wrong because it doesn't consider where in each zone we started this "UpdateSnapshot" process. And yeah, it would only be visible with uneven zones.

Great job spotting this @maelk!

I would think we have the same issue in older versions. However, it is hidden by the fact that we do a tree pass every time. Whereas in 1.18 we snapshot the result until there are changes in the tree.

Now that the round-robin strategy is implemented in generic_scheduler.go, we might be fine with simply resetting all counters before UpdateSnapshot, as your PR is doing.

https://github.com/kubernetes/kubernetes/blob/02cf58102a61b6d1e021e256381ff750573ce55d/pkg/scheduler/core/generic_scheduler.go#L357

Just to double check @ahg-g, this should be fine even in a cluster were new nodes are added/removed all the time, right?

alculquicondor on 23 Jul 2020

Thanks @maelk for spotting the root cause!

The purpose of the next function changed with the introduction of the NodeInfoList, and so we can certainly simplify it and perhaps change it to toList, a function that creates a list from the tree and simply start from the beginning every time.

Given that cache.nodeTree.next() is only called in building the snapshot nodeInfoList, I think it's also safe to remove the indexes (both zoneIndex and nodeIndex) from nodeTree struct. Instead, come up with a simple nodeIterator() function to iterate through its zone/node in a round-robin manner.

Huang-Wei on 23 Jul 2020

BTW: there is a typo in https://github.com/kubernetes/kubernetes/issues/91601#issuecomment-662663090, the case should be:

{
    name:           "add nodes to a new and to an exhausted zone",
    nodesToAdd:     append(allNodes[6:9], allNodes[3]),
    nodesToRemove:  nil,
    operations:     []string{"add", "add", "next", "next", "add", "add", "next", "next", "next", "next"},
    expectedOutput: []string{"node-6", "node-7", "node-3", "node-8", "node-6", "node-7"},
    // with codecase on master and 1.18, its output is [node-6 node-7 node-3 node-8 node-6 node-3]
},

Huang-Wei on 23 Jul 2020

Just to double check @ahg-g, this should be fine even in a cluster were new nodes are added/removed all the time, right?

I am assuming you are talking about the logic in generic_scheduler.go, if so yes, it is doesn't matter much if nodes were added or removed, the main thing we need to avoid is iterating over the nodes in the same order every time we schedule a pod, we just need a good approximation of iterating over the nodes across pods.

ahg-g on 23 Jul 2020

Given that cache.nodeTree.next() is only called in building the snapshot nodeInfoList, I think it's also safe to remove the indexes (both zoneIndex and nodeIndex) from nodeTree struct. Instead, come up with a simple nodeIterator() function to iterate through its zone/node in a round-robin manner.

yes, we just need to iterate over all zones/nodes in the same order every time.

ahg-g on 23 Jul 2020

I have updated the PR with a unit test for the function updating the snapshotlist, for that bug specifically. I can also take care of refactoring the next() function to iterate over the zones and nodes without round-robin, hence removing the issue.

maelk on 23 Jul 2020

Thanks, sounds good, but we should still iterate between zones the same way we do now, that is by design.

ahg-g on 23 Jul 2020

I don't really get what you mean here. Is it so that the order of the nodes matter and we must still go round-robin between zones or can we list all nodes of a zone, one zone after the other ? Let's say that you have two zones of two nodes each, in which order do you expect them, or does it even matter at all ?

maelk on 23 Jul 2020

The order matters, we need to alternate between zones while creating the list. If you have two zones of two nodes each z1: {n11, n12} and z2: {n21, n22}, then the list should be {n11, n21, n12, n22}

ahg-g on 23 Jul 2020

ok, thanks, I'll give it a thought. Can we meanwhile proceed with the quick fix ? btw, some tests are failing on it, but I am not sure how that relates to my PR

maelk on 23 Jul 2020

Those are flakes. Please send a patch to 1.18 as well.

ahg-g on 23 Jul 2020

Ok, will do. Thanks

maelk on 23 Jul 2020

{
  name:           "add nodes to a new and to an exhausted zone",
  nodesToAdd:     append(allNodes[5:9], allNodes[3]),
  nodesToRemove:  nil,
  operations:     []string{"add", "add", "next", "next", "add", "add", "add", "next", "next", "next", "next"},
  expectedOutput: []string{"node-6", "node-7", "node-3", "node-8", "node-6", "node-7"},
},

@maelk, do you mean this test ignore the 'node-5'?

I found after fixed the append in https://github.com/kubernetes/kubernetes/pull/93516, the test result all the nodes can be iterated:

{
            name:           "add nodes to a new and to an exhausted zone",
            nodesToAdd:     append(append(make([]*v1.Node, 0), allNodes[5:9]...), allNodes[3]),
            nodesToRemove:  nil,
            operations:     []string{"add", "add", "next", "next", "add", "add", "add", "next", "next", "next", "next"},
            expectedOutput: []string{"node-5", "node-6", "node-3", "node-7", "node-8", "node-5"},
},

The node-5, 6, 7, 8, 3 can be iterated.

Forgive me if misunderstand something here.

soulxu on 29 Jul 2020

yes, it was on purpose, based on what was there, but I can see how this can be cryptic, so better make it so that the append is behaving in a clearer way. Thanks for the patch.

maelk on 29 Jul 2020

How far back do you believe this bug was present? 1.17? 1.16? I've just seen the exact same problem in 1.17 on AWS and restarting the unscheduled node fixed the problem.

judgeaxl on 14 Sep 2020

@judgeaxl could you provide more details? Log lines, cache dumps, etc. So we can determine whether the issue is the same.

As I noted in https://github.com/kubernetes/kubernetes/issues/91601#issuecomment-662746695, I believe this bug was present in older versions, but my thinking is that it's transient.

@maelk would you be able to investigate?

alculquicondor on 14 Sep 2020

Please also share the distribution of nodes in the zones.

alculquicondor on 14 Sep 2020

@alculquicondor unfortunately I can't at this point. Sorry.

maelk on 14 Sep 2020

@alculquicondor sorry, I already rebuilt the cluster for other reasons, but it may have been a network configuration problem related to multi-az deployments, and in what subnet the faulty node got launched, so I wouldn't worry about it for now in the context of this issue. If I notice it again I'll report back with better details. Thanks!

judgeaxl on 15 Sep 2020

/retitle Some nodes are not considered in scheduling when there is zone imbalance

alculquicondor on 15 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings