Kubernetes: Some nodes are not considered in scheduling when there is zone imbalance

Created on 30 May 2020  ·  129Comments  ·  Source: kubernetes/kubernetes

What happened: We upgraded 15 kubernetes clusters from 1.17.5 to 1.18.2/1.18.3 and started to see that daemonsets does not work properly anymore.

The problem is that all daemonset pods does not provision. It will return following error message to events:

Events:
  Type     Reason            Age               From               Message
  ----     ------            ----              ----               -------
  Warning  FailedScheduling  9s (x5 over 71s)  default-scheduler  0/13 nodes are available: 12 node(s) didn't match node selector.

However, all nodes are available and it does not have node selector. Nodes does not have taints either.

daemonset https://gist.github.com/zetaab/4a605cb3e15e349934cb7db29ec72bd8

% kubectl get nodes
NAME                                   STATUS   ROLES    AGE   VERSION
e2etest-1-kaasprod-k8s-local           Ready    node     46h   v1.18.3
e2etest-2-kaasprod-k8s-local           Ready    node     46h   v1.18.3
e2etest-3-kaasprod-k8s-local           Ready    node     44h   v1.18.3
e2etest-4-kaasprod-k8s-local           Ready    node     44h   v1.18.3
master-zone-1-1-1-kaasprod-k8s-local   Ready    master   47h   v1.18.3
master-zone-2-1-1-kaasprod-k8s-local   Ready    master   47h   v1.18.3
master-zone-3-1-1-kaasprod-k8s-local   Ready    master   47h   v1.18.3
nodes-z1-1-kaasprod-k8s-local          Ready    node     47h   v1.18.3
nodes-z1-2-kaasprod-k8s-local          Ready    node     47h   v1.18.3
nodes-z2-1-kaasprod-k8s-local          Ready    node     46h   v1.18.3
nodes-z2-2-kaasprod-k8s-local          Ready    node     46h   v1.18.3
nodes-z3-1-kaasprod-k8s-local          Ready    node     47h   v1.18.3
nodes-z3-2-kaasprod-k8s-local          Ready    node     46h   v1.18.3

% kubectl get pods -n weave -l weave-scope-component=agent -o wide
NAME                      READY   STATUS    RESTARTS   AGE     IP           NODE                                   NOMINATED NODE   READINESS GATES
weave-scope-agent-2drzw   1/1     Running   0          26h     10.1.32.23   e2etest-1-kaasprod-k8s-local           <none>           <none>
weave-scope-agent-4kpxc   1/1     Running   3          26h     10.1.32.12   nodes-z1-2-kaasprod-k8s-local          <none>           <none>
weave-scope-agent-78n7r   1/1     Running   0          26h     10.1.32.7    e2etest-4-kaasprod-k8s-local           <none>           <none>
weave-scope-agent-9m4n8   1/1     Running   0          26h     10.1.96.4    master-zone-1-1-1-kaasprod-k8s-local   <none>           <none>
weave-scope-agent-b2gnk   1/1     Running   1          26h     10.1.96.12   master-zone-3-1-1-kaasprod-k8s-local   <none>           <none>
weave-scope-agent-blwtx   1/1     Running   2          26h     10.1.32.20   nodes-z1-1-kaasprod-k8s-local          <none>           <none>
weave-scope-agent-cbhjg   1/1     Running   0          26h     10.1.64.15   e2etest-2-kaasprod-k8s-local           <none>           <none>
weave-scope-agent-csp49   1/1     Running   0          26h     10.1.96.14   e2etest-3-kaasprod-k8s-local           <none>           <none>
weave-scope-agent-g4k2x   1/1     Running   1          26h     10.1.64.10   nodes-z2-2-kaasprod-k8s-local          <none>           <none>
weave-scope-agent-kx85h   1/1     Running   2          26h     10.1.96.6    nodes-z3-1-kaasprod-k8s-local          <none>           <none>
weave-scope-agent-lllqc   0/1     Pending   0          5m56s   <none>       <none>                                 <none>           <none>
weave-scope-agent-nls2h   1/1     Running   0          26h     10.1.96.17   master-zone-2-1-1-kaasprod-k8s-local   <none>           <none>
weave-scope-agent-p8njs   1/1     Running   2          26h     10.1.96.19   nodes-z3-2-kaasprod-k8s-local          <none>           <none>

I have tried to restart apiserver/schedulers/controller-managers but it does not help. Also I have tried to restart that single node that is stuck (nodes-z2-1-kaasprod-k8s-local) but it does not help either. Only deleting that node and recreating it helps.

% kubectl describe node nodes-z2-1-kaasprod-k8s-local
Name:               nodes-z2-1-kaasprod-k8s-local
Roles:              node
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=59cf4871-de1b-4294-9e9f-2ea7ca4b771f
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=regionOne
                    failure-domain.beta.kubernetes.io/zone=zone-2
                    kops.k8s.io/instancegroup=nodes-z2
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=nodes-z2-1-kaasprod-k8s-local
                    kubernetes.io/os=linux
                    kubernetes.io/role=node
                    node-role.kubernetes.io/node=
                    node.kubernetes.io/instance-type=59cf4871-de1b-4294-9e9f-2ea7ca4b771f
                    topology.cinder.csi.openstack.org/zone=zone-2
                    topology.kubernetes.io/region=regionOne
                    topology.kubernetes.io/zone=zone-2
Annotations:        csi.volume.kubernetes.io/nodeid: {"cinder.csi.openstack.org":"faf14d22-010f-494a-9b34-888bdad1d2df"}
                    node.alpha.kubernetes.io/ttl: 0
                    projectcalico.org/IPv4Address: 10.1.64.32/19
                    projectcalico.org/IPv4IPIPTunnelAddr: 100.98.136.0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Thu, 28 May 2020 13:28:24 +0300
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  nodes-z2-1-kaasprod-k8s-local
  AcquireTime:     <unset>
  RenewTime:       Sat, 30 May 2020 12:02:13 +0300
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Fri, 29 May 2020 09:40:51 +0300   Fri, 29 May 2020 09:40:51 +0300   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Sat, 30 May 2020 11:59:53 +0300   Fri, 29 May 2020 09:40:45 +0300   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Sat, 30 May 2020 11:59:53 +0300   Fri, 29 May 2020 09:40:45 +0300   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Sat, 30 May 2020 11:59:53 +0300   Fri, 29 May 2020 09:40:45 +0300   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Sat, 30 May 2020 11:59:53 +0300   Fri, 29 May 2020 09:40:45 +0300   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  10.1.64.32
  Hostname:    nodes-z2-1-kaasprod-k8s-local
Capacity:
  cpu:                4
  ephemeral-storage:  10287360Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             8172420Ki
  pods:               110
Allocatable:
  cpu:                4
  ephemeral-storage:  9480830961
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             8070020Ki
  pods:               110
System Info:
  Machine ID:                 c94284656ff04cf090852c1ddee7bcc2
  System UUID:                faf14d22-010f-494a-9b34-888bdad1d2df
  Boot ID:                    295dc3d9-0a90-49ee-92f3-9be45f2f8e3d
  Kernel Version:             4.19.0-8-cloud-amd64
  OS Image:                   Debian GNU/Linux 10 (buster)
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://19.3.8
  Kubelet Version:            v1.18.3
  Kube-Proxy Version:         v1.18.3
PodCIDR:                      100.96.12.0/24
PodCIDRs:                     100.96.12.0/24
ProviderID:                   openstack:///faf14d22-010f-494a-9b34-888bdad1d2df
Non-terminated Pods:          (3 in total)
  Namespace                   Name                                        CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                   ----                                        ------------  ----------  ---------------  -------------  ---
  kube-system                 calico-node-77pqs                           100m (2%)     200m (5%)   100Mi (1%)       100Mi (1%)     46h
  kube-system                 kube-proxy-nodes-z2-1-kaasprod-k8s-local    100m (2%)     200m (5%)   100Mi (1%)       100Mi (1%)     46h
  volume                      csi-cinder-nodeplugin-5jbvl                 100m (2%)     400m (10%)  200Mi (2%)       200Mi (2%)     46h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                300m (7%)   800m (20%)
  memory             400Mi (5%)  400Mi (5%)
  ephemeral-storage  0 (0%)      0 (0%)
Events:
  Type    Reason                   Age    From                                    Message
  ----    ------                   ----   ----                                    -------
  Normal  Starting                 7m27s  kubelet, nodes-z2-1-kaasprod-k8s-local  Starting kubelet.
  Normal  NodeHasSufficientMemory  7m26s  kubelet, nodes-z2-1-kaasprod-k8s-local  Node nodes-z2-1-kaasprod-k8s-local status is now: NodeHasSufficientMemory
  Normal  NodeHasNoDiskPressure    7m26s  kubelet, nodes-z2-1-kaasprod-k8s-local  Node nodes-z2-1-kaasprod-k8s-local status is now: NodeHasNoDiskPressure
  Normal  NodeHasSufficientPID     7m26s  kubelet, nodes-z2-1-kaasprod-k8s-local  Node nodes-z2-1-kaasprod-k8s-local status is now: NodeHasSufficientPID
  Normal  NodeAllocatableEnforced  7m26s  kubelet, nodes-z2-1-kaasprod-k8s-local  Updated Node Allocatable limit across pods

We are seeing this randomly in all of our clusters.

What you expected to happen: I expect that daemonset will provision to all nodes.

How to reproduce it (as minimally and precisely as possible): No idea really, install 1.18.x kubernetes and deploy daemonset and after that wait days(?)

Anything else we need to know?: When this happens we cannot provision any other daemonsets to that node either. Like you can see logging fluent-bit is also missing. I cannot see any errors in that node kubelet logs and like said, restarting does not help.

% kubectl get ds --all-namespaces
NAMESPACE     NAME                       DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                     AGE
falco         falco-daemonset            13        13        12      13           12          <none>                            337d
kube-system   audit-webhook-deployment   3         3         3       3            3           node-role.kubernetes.io/master=   174d
kube-system   calico-node                13        13        13      13           13          kubernetes.io/os=linux            36d
kube-system   kops-controller            3         3         3       3            3           node-role.kubernetes.io/master=   193d
kube-system   metricbeat                 6         6         5       6            5           <none>                            35d
kube-system   openstack-cloud-provider   3         3         3       3            3           node-role.kubernetes.io/master=   337d
logging       fluent-bit                 13        13        12      13           12          <none>                            337d
monitoring    node-exporter              13        13        12      13           12          kubernetes.io/os=linux            58d
volume        csi-cinder-nodeplugin      6         6         6       6            6           <none>                            239d
weave         weave-scope-agent          13        13        12      13           12          <none>                            193d
weave         weavescope-iowait-plugin   6         6         5       6            5           <none>                            193d

Like you can see, most of the daemonsets are missing one pod

Environment:

  • Kubernetes version (use kubectl version): 1.18.3
  • Cloud provider or hardware configuration: openstack
  • OS (e.g: cat /etc/os-release): debian buster
  • Kernel (e.g. uname -a): Linux nodes-z2-1-kaasprod-k8s-local 4.19.0-8-cloud-amd64 #1 SMP Debian 4.19.98-1+deb10u1 (2020-04-27) x86_64 GNU/Linux
  • Install tools: kops
  • Network plugin and version (if this is a network-related bug): calico
  • Others:
help wanted kinbug prioritimportant-soon sischeduling

Most helpful comment

I am now working on adding a test case for the snapshot, to make sure this is properly tested.

All 129 comments

/sig scheduling

Can you provide the full yaml of the node, daemonset, an example pod, and the containing namespace as retrieved from the server?

DaemonSet pods schedule with a nodeAffinity selector that only matches a single node, so the "12 out of 13 didn't match" message is expected.

I don't see a reason why the scheduler would be unhappy with the pod/node combo… there's no ports that could conflict in the podspec, the node is not unschedulable or tainted, and has sufficient resources

Okay I restarted all 3 schedulers (changed loglevel to 4 if we can see something interesting there). However, it fixed the issue

% kubectl get ds --all-namespaces
NAMESPACE     NAME                       DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                     AGE
falco         falco-daemonset            13        13        13      13           13          <none>                            338d
kube-system   audit-webhook-deployment   3         3         3       3            3           node-role.kubernetes.io/master=   175d
kube-system   calico-node                13        13        13      13           13          kubernetes.io/os=linux            36d
kube-system   kops-controller            3         3         3       3            3           node-role.kubernetes.io/master=   194d
kube-system   metricbeat                 6         6         6       6            6           <none>                            36d
kube-system   openstack-cloud-provider   3         3         3       3            3           node-role.kubernetes.io/master=   338d
logging       fluent-bit                 13        13        13      13           13          <none>                            338d
monitoring    node-exporter              13        13        13      13           13          kubernetes.io/os=linux            59d
volume        csi-cinder-nodeplugin      6         6         6       6            6           <none>                            239d
weave         weave-scope-agent          13        13        13      13           13          <none>                            194d
weave         weavescope-iowait-plugin   6         6         6       6            6           <none>                            194d

now all daemonsets are provisioned correctly. Weird, anyways something wrong with the scheduler it seems

cc @kubernetes/sig-scheduling-bugs @ahg-g

We see same similar issue on v1.18.3, one node cannot be scheduled for daemonset pods.
restart scheduler helps.

[root@tesla-cb0434-csfp1-csfp1-control-03 ~]# kubectl get pod -A|grep Pending
kube-system   coredns-vc5ws                                                 0/1     Pending   0          2d16h
kube-system   local-volume-provisioner-mwk88                                0/1     Pending   0          2d16h
kube-system   svcwatcher-ltqb6                                              0/1     Pending   0          2d16h
ncms          bcmt-api-hfzl6                                                0/1     Pending   0          2d16h
ncms          bcmt-yum-repo-589d8bb756-5zbvh                                0/1     Pending   0          2d16h
[root@tesla-cb0434-csfp1-csfp1-control-03 ~]# kubectl get ds -A
NAMESPACE     NAME                       DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                   AGE
kube-system   coredns                    3         3         2       3            2           is_control=true                 2d16h
kube-system   danmep-cleaner             0         0         0       0            0           cbcs.nokia.com/danm_node=true   2d16h
kube-system   kube-proxy                 8         8         8       8            8           <none>                          2d16h
kube-system   local-volume-provisioner   8         8         7       8            7           <none>                          2d16h
kube-system   netwatcher                 0         0         0       0            0           cbcs.nokia.com/danm_node=true   2d16h
kube-system   sriov-device-plugin        0         0         0       0            0           sriov=enabled                   2d16h
kube-system   svcwatcher                 3         3         2       3            2           is_control=true                 2d16h
ncms          bcmt-api                   3         3         0       3            0           is_control=true                 2d16h
[root@tesla-cb0434-csfp1-csfp1-control-03 ~]# kubectl get node
NAME                                  STATUS   ROLES    AGE     VERSION
tesla-cb0434-csfp1-csfp1-control-01   Ready    <none>   2d16h   v1.18.3
tesla-cb0434-csfp1-csfp1-control-02   Ready    <none>   2d16h   v1.18.3
tesla-cb0434-csfp1-csfp1-control-03   Ready    <none>   2d16h   v1.18.3
tesla-cb0434-csfp1-csfp1-edge-01      Ready    <none>   2d16h   v1.18.3
tesla-cb0434-csfp1-csfp1-edge-02      Ready    <none>   2d16h   v1.18.3
tesla-cb0434-csfp1-csfp1-worker-01    Ready    <none>   2d16h   v1.18.3
tesla-cb0434-csfp1-csfp1-worker-02    Ready    <none>   2d16h   v1.18.3
tesla-cb0434-csfp1-csfp1-worker-03    Ready    <none>   2d16h   v1.18.3

Hard to debug without knowing how to repreduce. Do you have the scheduler logs by any chance for the failed to schedule pod?

Okay I restarted all 3 schedulers

I assume only one of them is named default-scheduler, correct?

changed loglevel to 4 if we can see something interesting there

Can you share what you noticed?

set loglevel to 9, but it seems there is nothing more interesting, below logs are looping.

I0601 01:45:05.039373       1 generic_scheduler.go:290] Preemption will not help schedule pod kube-system/coredns-vc5ws on any node.
I0601 01:45:05.039437       1 factory.go:462] Unable to schedule kube-system/coredns-vc5ws: no fit: 0/8 nodes are available: 7 node(s) didn't match node selector.; waiting
I0601 01:45:05.039494       1 scheduler.go:776] Updating pod condition for kube-system/coredns-vc5ws to (PodScheduled==False, Reason=Unschedulable)

yeah I could not see anything more than same line

no fit: 0/8 nodes are available: 7 node(s) didn't match node selector.; waiting

the strange thing is that the log message is showing the result for 7 nodes only, like the issue reported in https://github.com/kubernetes/kubernetes/issues/91340

/cc @damemi

@ahg-g this does look like the same issue I reported there, it seems like we either have a filter plugin that doesn't always report its error or some other condition that's failing silently if I had to guess

Note that in my issue, restarting the scheduler also fixed it (as mentioned in this thread too https://github.com/kubernetes/kubernetes/issues/91601#issuecomment-636360092)

Mine was also about a daemonset, so I think this is a duplicate. If that's the case we can close this and continue discussion in https://github.com/kubernetes/kubernetes/issues/91340

Anyways scheduler needs more verbose logging option, its impossible to debug these issues if there are not logs about what it does

@zetaab +1, the scheduler could use significant improvements to its current logging abilities. That's an upgrade I've been meaning to tackle for a while and I've finally opened an issue for it here: https://github.com/kubernetes/kubernetes/issues/91633

/assign

I'm looking into this. A few questions to help me narrow the case. I haven't been able to reproduce yet.

  • What was created first: the daemonset or the node?
  • Are you using the default profile?
  • Do you have extenders?

nodes were created before the daemonset.
suppose we used the default profile, which profile do you mean and how to check?
no extenders.

    command:
    - /usr/local/bin/kube-scheduler
    - --address=127.0.0.1
    - --kubeconfig=/etc/kubernetes/kube-scheduler.kubeconfig
    - --profiling=false
    - --v=1

Another thing that may impact is the disk performance is not very good for etcd, etcd complains about slow operations.

Yes, those flags would make scheduler run with the default profile. I'll continue looking. I still couldn't reproduce.

Still nothing... Anything else you are using that you think could be impacting? taints, ports, other resources?

Made some tries related to this. When the issue is on, pods can still be scheduled to the node (without definition or with "nodeName" selector).

If trying to use Affinity/Antiaffinity, pods doesn't get scheduled to node.

Working when issue is on:

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: nginx
  name: nginx
spec:
  nodeName: master-zone-3-1-1-test-cluster-k8s-local
  containers:
    - image: nginx
      name: nginx
      resources: {}
  dnsPolicy: ClusterFirst
  restartPolicy: Always

Not working at the same time:

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: nginx
  name: nginx
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: kubernetes.io/hostname
                operator: In
                values:
                  - master-zone-3-1-1-test-cluster-k8s-local
  containers:
    - image: nginx
      name: nginx
      resources: {}
  dnsPolicy: ClusterFirst
  restartPolicy: Always

Also when checked the latter's even those were quite interesting:

Warning  FailedScheduling  4m37s (x17 over 26m)  default-scheduler  0/9 nodes are available: 8 node(s) didn't match node selector.
Warning  FailedScheduling  97s (x6 over 3m39s)   default-scheduler  0/8 nodes are available: 8 node(s) didn't match node selector.
Warning  FailedScheduling  53s                   default-scheduler  0/8 nodes are available: 8 node(s) didn't match node selector.
Warning  FailedScheduling  7s (x5 over 32s)      default-scheduler  0/9 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 7 node(s) didn't match node selector.
  • First event is when manifest was just applied (nothing done to the non-schedulable node).
  • Second and third were when node was removed with kubectl and then restarted.
  • Fourth came when the node came back up. The node that had an issue was master, so node was not going there (but it shows that node was not found at 3 earlier events). Interesting thing with fourth event is that there's still information from one node missing. Event says there's 0/9 nodes available, but description is given only from 8.

"nodeName" is not a selector. Using nodeName would bypass scheduling.

Fourth came when the node came back up. The node that had an issue was master, so node was not going there (but it shows that node was not found at 3 earlier events). Interesting thing with fourth event is that there's still information from one node missing. Event says there's 0/9 nodes available, but description is given only from 8.

You are saying that the reason why the pod shouldn't have been scheduled in the missing node is because it was a master?

We are seeing 8 node(s) didn't match node selector going to 7. I assume no nodes were removed at this point, correct?

"nodeName" is not a selector. Using nodeName would bypass scheduling.

"NodeName" try was to highligh, that node is usable and pod gets there if wanted. So thing is not node's unability to start pods.

Fourth came when the node came back up. The node that had an issue was master, so node was not going there (but it shows that node was not found at 3 earlier events). Interesting thing with fourth event is that there's still information from one node missing. Event says there's 0/9 nodes available, but description is given only from 8.

You are saying that the reason why the pod shouldn't have been scheduled in the missing node is because it was a master?

We are seeing 8 node(s) didn't match node selector going to 7. I assume no nodes were removed at this point, correct?

Test cluster has 9 nodes; 3 masters and 6 workers. Before the non-working node was successfully started, events told information about all available nodes: 0/8 nodes are available: 8 node(s) didn't match node selector.. But when that node that would match node selector came up, the event told 0/9 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 7 node(s) didn't match node selector. Explanation tells that there's 8 that are not matching, but doesn't tell anything about the ninth (that was acknowledged on the previous event).

So the event state:

  • 1st event: 9 nodes available, the error noticed with daemonset
  • 2nd and 3rd event: 8 nodes available. The one that was not receiving pod was restarting
  • 4th event: 9 nodes available (so the one started that was restarted).

At the end test pod wasn't started at the matching node because of taints, but that's other story (and should have been the case already at the 1st event).

"NodeName" try was to highligh, that node is usable and pod gets there if wanted. So thing is not node's unability to start pods.

Note that nothing guards against over-committing a node, but the scheduler. So this doesn't really show much.

At the end test pod wasn't started at the matching node because of taints, but that's other story (and should have been the case already at the 1st event).

My question is: was the 9th node tainted from the beginning? I'm trying to look for (1) reproducible steps to reach the state or (2) where the bug could be.

My question is: was the 9th node tainted from the beginning? I'm trying to look for (1) reproducible steps to reach the state or (2) where the bug could be.

Yes, taint was there all the time at this case, as the non-receiving node was master. But we have seen the same issue on both masters and workers.

Still no idea where the issue comes from, just that at least recreation of node and restart of the node seem to be fixing the issue. But those are a bit "hard" ways to fix the things.

Long shot, but if you run into it again... could you check if there are any nominated pods to the node that doesn't show up?

I'm posting questions as I think of possible scenarios:

  • Do you have other master nodes in your cluster?
  • Do you have extenders?
* Do you have other master nodes in your cluster?

All clusers have 3 masters (so restarting of those is easy)

* Do you have extenders?

No.

One interesting thing noticed today: I had cluster where one master was not receiving pod from DaemonSet. We have ChaosMonkey in use, which terminated one of the worker nodes. That's interesting, this made the pod to go to the master that was not receiving it earlier. So somehow removal of other node than the problematic one seemed to be fixing the issue at that point.

Because of that "fix" I have to wait problem to reoccur to be able to answer about the nominated pods.

I'm confused now... Does your daemonset tolerate the taint for master nodes? In other words... is the bug for you just the scheduling event or also the fact that the pods should have been scheduled?

Issue is, that node is not found by scheduler even there's at least one matching affinity (or antiaffinity) settings.

That's why I said that the taint error is expected, and should have been there already at the first event (as taint is not part of the affinity criteria)

Understood. I was trying to confirm your setup to make sure I'm not missing something.

I don't think the node is "unseen" by the scheduler. Given that we see 0/9 nodes are available, we can conclude that the node is indeed in the cache. It's more like the unschedulable reason is lost somewhere, so we don't include it in the event.

True, total count matches always with actual node count. Just more descriptive event text is not given on all nodes, but that can be separate issue as you mentioned.

Are you able to look at your kube-scheduler logs? Anything that seems relevant?

I think @zetaab tried to look for that without success. I can try when the issue occurs again (as well as that nominated pod thing asked earlier)

If possible, also run 1.18.5, in case we inadvertently fixed the issue.

I am able to reproduce this reliably on my test cluster if you need any more logs

@dilyevsky Please share repro steps. Can you somehow identify what is the filter that is failing?

It appears to be just the metadata.name of the node for the ds pod... weird. Here's the pod yaml:

Pod yaml:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    scheduler.alpha.kubernetes.io/critical-pod: ""
  creationTimestamp: "2020-07-09T23:17:53Z"
  generateName: cilium-
  labels:
    controller-revision-hash: 6c94db8bb8
    k8s-app: cilium
    pod-template-generation: "1"
  managedFields:
    # managed fields crap
  name: cilium-d5n4f
  namespace: kube-system
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: DaemonSet
    name: cilium
    uid: 0f00e8af-eb19-4985-a940-a02fa84fcbc5
  resourceVersion: "2840"
  selfLink: /api/v1/namespaces/kube-system/pods/cilium-d5n4f
  uid: e3f7d566-ee5b-4557-8d1b-f0964cde2f22
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchFields:
          - key: metadata.name
            operator: In
            values:
            - us-central1-dilyevsky-master-qmwnl
  containers:
  - args:
    - --config-dir=/tmp/cilium/config-map
    command:
    - cilium-agent
    env:
    - name: K8S_NODE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName
    - name: CILIUM_K8S_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    - name: CILIUM_FLANNEL_MASTER_DEVICE
      valueFrom:
        configMapKeyRef:
          key: flannel-master-device
          name: cilium-config
          optional: true
    - name: CILIUM_FLANNEL_UNINSTALL_ON_EXIT
      valueFrom:
        configMapKeyRef:
          key: flannel-uninstall-on-exit
          name: cilium-config
          optional: true
    - name: CILIUM_CLUSTERMESH_CONFIG
      value: /var/lib/cilium/clustermesh/
    - name: CILIUM_CNI_CHAINING_MODE
      valueFrom:
        configMapKeyRef:
          key: cni-chaining-mode
          name: cilium-config
          optional: true
    - name: CILIUM_CUSTOM_CNI_CONF
      valueFrom:
        configMapKeyRef:
          key: custom-cni-conf
          name: cilium-config
          optional: true
    image: docker.io/cilium/cilium:v1.7.6
    imagePullPolicy: IfNotPresent
    lifecycle:
      postStart:
        exec:
          command:
          - /cni-install.sh
          - --enable-debug=false
      preStop:
        exec:
          command:
          - /cni-uninstall.sh
    livenessProbe:
      exec:
        command:
        - cilium
        - status
        - --brief
      failureThreshold: 10
      initialDelaySeconds: 120
      periodSeconds: 30
      successThreshold: 1
      timeoutSeconds: 5
    name: cilium-agent
    readinessProbe:
      exec:
        command:
        - cilium
        - status
        - --brief
      failureThreshold: 3
      initialDelaySeconds: 5
      periodSeconds: 30
      successThreshold: 1
      timeoutSeconds: 5
    resources: {}
    securityContext:
      capabilities:
        add:
        - NET_ADMIN
        - SYS_MODULE
      privileged: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/cilium
      name: cilium-run
    - mountPath: /host/opt/cni/bin
      name: cni-path
    - mountPath: /host/etc/cni/net.d
      name: etc-cni-netd
    - mountPath: /var/lib/cilium/clustermesh
      name: clustermesh-secrets
      readOnly: true
    - mountPath: /tmp/cilium/config-map
      name: cilium-config-path
      readOnly: true
    - mountPath: /lib/modules
      name: lib-modules
      readOnly: true
    - mountPath: /run/xtables.lock
      name: xtables-lock
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: cilium-token-j74lr
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostNetwork: true
  initContainers:
  - command:
    - /init-container.sh
    env:
    - name: CILIUM_ALL_STATE
      valueFrom:
        configMapKeyRef:
          key: clean-cilium-state
          name: cilium-config
          optional: true
    - name: CILIUM_BPF_STATE
      valueFrom:
        configMapKeyRef:
          key: clean-cilium-bpf-state
          name: cilium-config
          optional: true
    - name: CILIUM_WAIT_BPF_MOUNT
      valueFrom:
        configMapKeyRef:
          key: wait-bpf-mount
          name: cilium-config
          optional: true
    image: docker.io/cilium/cilium:v1.7.6
    imagePullPolicy: IfNotPresent
    name: clean-cilium-state
    resources: {}
    securityContext:
      capabilities:
        add:
        - NET_ADMIN
      privileged: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/cilium
      name: cilium-run
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: cilium-token-j74lr
      readOnly: true
  priority: 2000001000
  priorityClassName: system-node-critical
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: cilium
  serviceAccountName: cilium
  terminationGracePeriodSeconds: 1
  tolerations:
  - operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/disk-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/pid-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/network-unavailable
    operator: Exists
  volumes:
  - hostPath:
      path: /var/run/cilium
      type: DirectoryOrCreate
    name: cilium-run
  - hostPath:
      path: /opt/cni/bin
      type: DirectoryOrCreate
    name: cni-path
  - hostPath:
      path: /etc/cni/net.d
      type: DirectoryOrCreate
    name: etc-cni-netd
  - hostPath:
      path: /lib/modules
      type: ""
    name: lib-modules
  - hostPath:
      path: /run/xtables.lock
      type: FileOrCreate
    name: xtables-lock
  - name: clustermesh-secrets
    secret:
      defaultMode: 420
      optional: true
      secretName: cilium-clustermesh
  - configMap:
      defaultMode: 420
      name: cilium-config
    name: cilium-config-path
  - name: cilium-token-j74lr
    secret:
      defaultMode: 420
      secretName: cilium-token-j74lr
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2020-07-09T23:17:53Z"
    message: '0/6 nodes are available: 5 node(s) didn''t match node selector.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending
  qosClass: BestEffort

The way I reproduce this is by spinning up new cluster with 3 masters and 3 worker nodes (using Cluster API) and applying Cilium 1.7.6:

Cilium yaml:

---
# Source: cilium/charts/agent/templates/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: cilium
  namespace: kube-system
---
# Source: cilium/charts/operator/templates/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: cilium-operator
  namespace: kube-system
---
# Source: cilium/charts/config/templates/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: cilium-config
  namespace: kube-system
data:

  # Identity allocation mode selects how identities are shared between cilium
  # nodes by setting how they are stored. The options are "crd" or "kvstore".
  # - "crd" stores identities in kubernetes as CRDs (custom resource definition).
  #   These can be queried with:
  #     kubectl get ciliumid
  # - "kvstore" stores identities in a kvstore, etcd or consul, that is
  #   configured below. Cilium versions before 1.6 supported only the kvstore
  #   backend. Upgrades from these older cilium versions should continue using
  #   the kvstore by commenting out the identity-allocation-mode below, or
  #   setting it to "kvstore".
  identity-allocation-mode: crd

  # If you want to run cilium in debug mode change this value to true
  debug: "false"

  # Enable IPv4 addressing. If enabled, all endpoints are allocated an IPv4
  # address.
  enable-ipv4: "true"

  # Enable IPv6 addressing. If enabled, all endpoints are allocated an IPv6
  # address.
  enable-ipv6: "false"

  # If you want cilium monitor to aggregate tracing for packets, set this level
  # to "low", "medium", or "maximum". The higher the level, the less packets
  # that will be seen in monitor output.
  monitor-aggregation: medium

  # The monitor aggregation interval governs the typical time between monitor
  # notification events for each allowed connection.
  #
  # Only effective when monitor aggregation is set to "medium" or higher.
  monitor-aggregation-interval: 5s

  # The monitor aggregation flags determine which TCP flags which, upon the
  # first observation, cause monitor notifications to be generated.
  #
  # Only effective when monitor aggregation is set to "medium" or higher.
  monitor-aggregation-flags: all

  # ct-global-max-entries-* specifies the maximum number of connections
  # supported across all endpoints, split by protocol: tcp or other. One pair
  # of maps uses these values for IPv4 connections, and another pair of maps
  # use these values for IPv6 connections.
  #
  # If these values are modified, then during the next Cilium startup the
  # tracking of ongoing connections may be disrupted. This may lead to brief
  # policy drops or a change in loadbalancing decisions for a connection.
  #
  # For users upgrading from Cilium 1.2 or earlier, to minimize disruption
  # during the upgrade process, comment out these options.
  bpf-ct-global-tcp-max: "524288"
  bpf-ct-global-any-max: "262144"

  # bpf-policy-map-max specified the maximum number of entries in endpoint
  # policy map (per endpoint)
  bpf-policy-map-max: "16384"

  # Pre-allocation of map entries allows per-packet latency to be reduced, at
  # the expense of up-front memory allocation for the entries in the maps. The
  # default value below will minimize memory usage in the default installation;
  # users who are sensitive to latency may consider setting this to "true".
  #
  # This option was introduced in Cilium 1.4. Cilium 1.3 and earlier ignore
  # this option and behave as though it is set to "true".
  #
  # If this value is modified, then during the next Cilium startup the restore
  # of existing endpoints and tracking of ongoing connections may be disrupted.
  # This may lead to policy drops or a change in loadbalancing decisions for a
  # connection for some time. Endpoints may need to be recreated to restore
  # connectivity.
  #
  # If this option is set to "false" during an upgrade from 1.3 or earlier to
  # 1.4 or later, then it may cause one-time disruptions during the upgrade.
  preallocate-bpf-maps: "false"

  # Regular expression matching compatible Istio sidecar istio-proxy
  # container image names
  sidecar-istio-proxy-image: "cilium/istio_proxy"

  # Encapsulation mode for communication between nodes
  # Possible values:
  #   - disabled
  #   - vxlan (default)
  #   - geneve
  tunnel: vxlan

  # Name of the cluster. Only relevant when building a mesh of clusters.
  cluster-name: default

  # DNS Polling periodically issues a DNS lookup for each `matchName` from
  # cilium-agent. The result is used to regenerate endpoint policy.
  # DNS lookups are repeated with an interval of 5 seconds, and are made for
  # A(IPv4) and AAAA(IPv6) addresses. Should a lookup fail, the most recent IP
  # data is used instead. An IP change will trigger a regeneration of the Cilium
  # policy for each endpoint and increment the per cilium-agent policy
  # repository revision.
  #
  # This option is disabled by default starting from version 1.4.x in favor
  # of a more powerful DNS proxy-based implementation, see [0] for details.
  # Enable this option if you want to use FQDN policies but do not want to use
  # the DNS proxy.
  #
  # To ease upgrade, users may opt to set this option to "true".
  # Otherwise please refer to the Upgrade Guide [1] which explains how to
  # prepare policy rules for upgrade.
  #
  # [0] http://docs.cilium.io/en/stable/policy/language/#dns-based
  # [1] http://docs.cilium.io/en/stable/install/upgrade/#changes-that-may-require-action
  tofqdns-enable-poller: "false"

  # wait-bpf-mount makes init container wait until bpf filesystem is mounted
  wait-bpf-mount: "false"

  masquerade: "true"
  enable-xt-socket-fallback: "true"
  install-iptables-rules: "true"
  auto-direct-node-routes: "false"
  kube-proxy-replacement:  "probe"
  enable-host-reachable-services: "false"
  enable-external-ips: "false"
  enable-node-port: "false"
  node-port-bind-protection: "true"
  enable-auto-protect-node-port-range: "true"
  enable-endpoint-health-checking: "true"
  enable-well-known-identities: "false"
  enable-remote-node-identity: "true"
---
# Source: cilium/charts/agent/templates/clusterrole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: cilium
rules:
- apiGroups:
  - networking.k8s.io
  resources:
  - networkpolicies
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - discovery.k8s.io
  resources:
  - endpointslices
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - namespaces
  - services
  - nodes
  - endpoints
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - pods
  - nodes
  verbs:
  - get
  - list
  - watch
  - update
- apiGroups:
  - ""
  resources:
  - nodes
  - nodes/status
  verbs:
  - patch
- apiGroups:
  - apiextensions.k8s.io
  resources:
  - customresourcedefinitions
  verbs:
  - create
  - get
  - list
  - watch
  - update
- apiGroups:
  - cilium.io
  resources:
  - ciliumnetworkpolicies
  - ciliumnetworkpolicies/status
  - ciliumclusterwidenetworkpolicies
  - ciliumclusterwidenetworkpolicies/status
  - ciliumendpoints
  - ciliumendpoints/status
  - ciliumnodes
  - ciliumnodes/status
  - ciliumidentities
  - ciliumidentities/status
  verbs:
  - '*'
---
# Source: cilium/charts/operator/templates/clusterrole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: cilium-operator
rules:
- apiGroups:
  - ""
  resources:
  # to automatically delete [core|kube]dns pods so that are starting to being
  # managed by Cilium
  - pods
  verbs:
  - get
  - list
  - watch
  - delete
- apiGroups:
  - discovery.k8s.io
  resources:
  - endpointslices
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  # to automatically read from k8s and import the node's pod CIDR to cilium's
  # etcd so all nodes know how to reach another pod running in in a different
  # node.
  - nodes
  # to perform the translation of a CNP that contains `ToGroup` to its endpoints
  - services
  - endpoints
  # to check apiserver connectivity
  - namespaces
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - cilium.io
  resources:
  - ciliumnetworkpolicies
  - ciliumnetworkpolicies/status
  - ciliumclusterwidenetworkpolicies
  - ciliumclusterwidenetworkpolicies/status
  - ciliumendpoints
  - ciliumendpoints/status
  - ciliumnodes
  - ciliumnodes/status
  - ciliumidentities
  - ciliumidentities/status
  verbs:
  - '*'
---
# Source: cilium/charts/agent/templates/clusterrolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: cilium
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cilium
subjects:
- kind: ServiceAccount
  name: cilium
  namespace: kube-system
---
# Source: cilium/charts/operator/templates/clusterrolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: cilium-operator
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cilium-operator
subjects:
- kind: ServiceAccount
  name: cilium-operator
  namespace: kube-system
---
# Source: cilium/charts/agent/templates/daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    k8s-app: cilium
  name: cilium
  namespace: kube-system
spec:
  selector:
    matchLabels:
      k8s-app: cilium
  template:
    metadata:
      annotations:
        # This annotation plus the CriticalAddonsOnly toleration makes
        # cilium to be a critical pod in the cluster, which ensures cilium
        # gets priority scheduling.
        # https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        k8s-app: cilium
    spec:
      containers:
      - args:
        - --config-dir=/tmp/cilium/config-map
        command:
        - cilium-agent
        livenessProbe:
          exec:
            command:
            - cilium
            - status
            - --brief
          failureThreshold: 10
          # The initial delay for the liveness probe is intentionally large to
          # avoid an endless kill & restart cycle if in the event that the initial
          # bootstrapping takes longer than expected.
          initialDelaySeconds: 120
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 5
        readinessProbe:
          exec:
            command:
            - cilium
            - status
            - --brief
          failureThreshold: 3
          initialDelaySeconds: 5
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 5
        env:
        - name: K8S_NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: CILIUM_K8S_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: CILIUM_FLANNEL_MASTER_DEVICE
          valueFrom:
            configMapKeyRef:
              key: flannel-master-device
              name: cilium-config
              optional: true
        - name: CILIUM_FLANNEL_UNINSTALL_ON_EXIT
          valueFrom:
            configMapKeyRef:
              key: flannel-uninstall-on-exit
              name: cilium-config
              optional: true
        - name: CILIUM_CLUSTERMESH_CONFIG
          value: /var/lib/cilium/clustermesh/
        - name: CILIUM_CNI_CHAINING_MODE
          valueFrom:
            configMapKeyRef:
              key: cni-chaining-mode
              name: cilium-config
              optional: true
        - name: CILIUM_CUSTOM_CNI_CONF
          valueFrom:
            configMapKeyRef:
              key: custom-cni-conf
              name: cilium-config
              optional: true
        image: "docker.io/cilium/cilium:v1.7.6"
        imagePullPolicy: IfNotPresent
        lifecycle:
          postStart:
            exec:
              command:
              - "/cni-install.sh"
              - "--enable-debug=false"
          preStop:
            exec:
              command:
              - /cni-uninstall.sh
        name: cilium-agent
        securityContext:
          capabilities:
            add:
            - NET_ADMIN
            - SYS_MODULE
          privileged: true
        volumeMounts:
        - mountPath: /var/run/cilium
          name: cilium-run
        - mountPath: /host/opt/cni/bin
          name: cni-path
        - mountPath: /host/etc/cni/net.d
          name: etc-cni-netd
        - mountPath: /var/lib/cilium/clustermesh
          name: clustermesh-secrets
          readOnly: true
        - mountPath: /tmp/cilium/config-map
          name: cilium-config-path
          readOnly: true
          # Needed to be able to load kernel modules
        - mountPath: /lib/modules
          name: lib-modules
          readOnly: true
        - mountPath: /run/xtables.lock
          name: xtables-lock
      hostNetwork: true
      initContainers:
      - command:
        - /init-container.sh
        env:
        - name: CILIUM_ALL_STATE
          valueFrom:
            configMapKeyRef:
              key: clean-cilium-state
              name: cilium-config
              optional: true
        - name: CILIUM_BPF_STATE
          valueFrom:
            configMapKeyRef:
              key: clean-cilium-bpf-state
              name: cilium-config
              optional: true
        - name: CILIUM_WAIT_BPF_MOUNT
          valueFrom:
            configMapKeyRef:
              key: wait-bpf-mount
              name: cilium-config
              optional: true
        image: "docker.io/cilium/cilium:v1.7.6"
        imagePullPolicy: IfNotPresent
        name: clean-cilium-state
        securityContext:
          capabilities:
            add:
            - NET_ADMIN
          privileged: true
        volumeMounts:
        - mountPath: /var/run/cilium
          name: cilium-run
      restartPolicy: Always
      priorityClassName: system-node-critical
      serviceAccount: cilium
      serviceAccountName: cilium
      terminationGracePeriodSeconds: 1
      tolerations:
      - operator: Exists
      volumes:
        # To keep state between restarts / upgrades
      - hostPath:
          path: /var/run/cilium
          type: DirectoryOrCreate
        name: cilium-run
      # To install cilium cni plugin in the host
      - hostPath:
          path:  /opt/cni/bin
          type: DirectoryOrCreate
        name: cni-path
        # To install cilium cni configuration in the host
      - hostPath:
          path: /etc/cni/net.d
          type: DirectoryOrCreate
        name: etc-cni-netd
        # To be able to load kernel modules
      - hostPath:
          path: /lib/modules
        name: lib-modules
        # To access iptables concurrently with other processes (e.g. kube-proxy)
      - hostPath:
          path: /run/xtables.lock
          type: FileOrCreate
        name: xtables-lock
        # To read the clustermesh configuration
      - name: clustermesh-secrets
        secret:
          defaultMode: 420
          optional: true
          secretName: cilium-clustermesh
        # To read the configuration from the config map
      - configMap:
          name: cilium-config
        name: cilium-config-path
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 2
    type: RollingUpdate
---
# Source: cilium/charts/operator/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    io.cilium/app: operator
    name: cilium-operator
  name: cilium-operator
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      io.cilium/app: operator
      name: cilium-operator
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      annotations:
      labels:
        io.cilium/app: operator
        name: cilium-operator
    spec:
      containers:
      - args:
        - --debug=$(CILIUM_DEBUG)
        - --identity-allocation-mode=$(CILIUM_IDENTITY_ALLOCATION_MODE)
        - --synchronize-k8s-nodes=true
        command:
        - cilium-operator
        env:
        - name: CILIUM_K8S_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: K8S_NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: CILIUM_DEBUG
          valueFrom:
            configMapKeyRef:
              key: debug
              name: cilium-config
              optional: true
        - name: CILIUM_CLUSTER_NAME
          valueFrom:
            configMapKeyRef:
              key: cluster-name
              name: cilium-config
              optional: true
        - name: CILIUM_CLUSTER_ID
          valueFrom:
            configMapKeyRef:
              key: cluster-id
              name: cilium-config
              optional: true
        - name: CILIUM_IPAM
          valueFrom:
            configMapKeyRef:
              key: ipam
              name: cilium-config
              optional: true
        - name: CILIUM_DISABLE_ENDPOINT_CRD
          valueFrom:
            configMapKeyRef:
              key: disable-endpoint-crd
              name: cilium-config
              optional: true
        - name: CILIUM_KVSTORE
          valueFrom:
            configMapKeyRef:
              key: kvstore
              name: cilium-config
              optional: true
        - name: CILIUM_KVSTORE_OPT
          valueFrom:
            configMapKeyRef:
              key: kvstore-opt
              name: cilium-config
              optional: true
        - name: AWS_ACCESS_KEY_ID
          valueFrom:
            secretKeyRef:
              key: AWS_ACCESS_KEY_ID
              name: cilium-aws
              optional: true
        - name: AWS_SECRET_ACCESS_KEY
          valueFrom:
            secretKeyRef:
              key: AWS_SECRET_ACCESS_KEY
              name: cilium-aws
              optional: true
        - name: AWS_DEFAULT_REGION
          valueFrom:
            secretKeyRef:
              key: AWS_DEFAULT_REGION
              name: cilium-aws
              optional: true
        - name: CILIUM_IDENTITY_ALLOCATION_MODE
          valueFrom:
            configMapKeyRef:
              key: identity-allocation-mode
              name: cilium-config
              optional: true
        image: "docker.io/cilium/operator:v1.7.6"
        imagePullPolicy: IfNotPresent
        name: cilium-operator
        livenessProbe:
          httpGet:
            host: '127.0.0.1'
            path: /healthz
            port: 9234
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 10
          timeoutSeconds: 3
      hostNetwork: true
      restartPolicy: Always
      serviceAccount: cilium-operator
      serviceAccountName: cilium-operator

Here's the scheduler log:
I0709 23:08:22.055830 1 registry.go:150] Registering EvenPodsSpread predicate and priority function I0709 23:08:22.056081 1 registry.go:150] Registering EvenPodsSpread predicate and priority function I0709 23:08:23.137451 1 serving.go:313] Generated self-signed cert in-memory W0709 23:08:33.843509 1 authentication.go:297] Error looking up in-cluster authentication configuration: etcdserver: request timed out W0709 23:08:33.843671 1 authentication.go:298] Continuing without authentication configuration. This may treat all requests as anonymous. W0709 23:08:33.843710 1 authentication.go:299] To require authentication configuration lookup to succeed, set --authentication-tolerate-lookup-failure=false I0709 23:08:33.911805 1 registry.go:150] Registering EvenPodsSpread predicate and priority function I0709 23:08:33.911989 1 registry.go:150] Registering EvenPodsSpread predicate and priority function W0709 23:08:33.917999 1 authorization.go:47] Authorization is disabled W0709 23:08:33.918162 1 authentication.go:40] Authentication is disabled I0709 23:08:33.918238 1 deprecated_insecure_serving.go:51] Serving healthz insecurely on [::]:10251 I0709 23:08:33.925860 1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::client-ca-file I0709 23:08:33.926013 1 shared_informer.go:223] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file I0709 23:08:33.930685 1 secure_serving.go:178] Serving securely on 127.0.0.1:10259 I0709 23:08:33.936198 1 tlsconfig.go:240] Starting DynamicServingCertificateController I0709 23:08:34.026382 1 shared_informer.go:230] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file I0709 23:08:34.036998 1 leaderelection.go:242] attempting to acquire leader lease kube-system/kube-scheduler... I0709 23:08:50.597201 1 leaderelection.go:252] successfully acquired lease kube-system/kube-scheduler E0709 23:08:50.658551 1 factory.go:503] pod: kube-system/coredns-66bff467f8-9rjvd is already present in the active queue E0709 23:12:27.673854 1 factory.go:503] pod kube-system/cilium-vv466 is already present in the backoff queue E0709 23:12:58.099432 1 leaderelection.go:320] error retrieving resource lock kube-system/kube-scheduler: etcdserver: leader changed

After restarting scheduler pods, the pending pod immediately schedules.

What pod events do you get? Do you know if there are taints in the node
where it doesn't get scheduled? Does it only fail for master nodes or any
nodes? Is there enough space in the node?

On Thu., Jul. 9, 2020, 7:49 p.m. dilyevsky, notifications@github.com
wrote:

It appears to be just the metadata.name of the node for the ds pod...
weird. Here's the pod yaml:

apiVersion: v1kind: Podmetadata:
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
creationTimestamp: "2020-07-09T23:17:53Z"
generateName: cilium-
labels:
controller-revision-hash: 6c94db8bb8
k8s-app: cilium
pod-template-generation: "1"
managedFields:
# managed fields crap
name: cilium-d5n4f
namespace: kube-system
ownerReferences:

  • apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: DaemonSet
    name: cilium
    uid: 0f00e8af-eb19-4985-a940-a02fa84fcbc5
    resourceVersion: "2840"
    selfLink: /api/v1/namespaces/kube-system/pods/cilium-d5n4f
    uid: e3f7d566-ee5b-4557-8d1b-f0964cde2f22spec:
    affinity:
    nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    nodeSelectorTerms:
    - matchFields:
    - key: metadata.name
    operator: In
    values:
    - us-central1-dilyevsky-master-qmwnl
    containers:
  • args:

    • --config-dir=/tmp/cilium/config-map

      command:

    • cilium-agent

      env:

    • name: K8S_NODE_NAME

      valueFrom:

      fieldRef:

      apiVersion: v1

      fieldPath: spec.nodeName

    • name: CILIUM_K8S_NAMESPACE

      valueFrom:

      fieldRef:

      apiVersion: v1

      fieldPath: metadata.namespace

    • name: CILIUM_FLANNEL_MASTER_DEVICE

      valueFrom:

      configMapKeyRef:

      key: flannel-master-device

      name: cilium-config

      optional: true

    • name: CILIUM_FLANNEL_UNINSTALL_ON_EXIT

      valueFrom:

      configMapKeyRef:

      key: flannel-uninstall-on-exit

      name: cilium-config

      optional: true

    • name: CILIUM_CLUSTERMESH_CONFIG

      value: /var/lib/cilium/clustermesh/

    • name: CILIUM_CNI_CHAINING_MODE

      valueFrom:

      configMapKeyRef:

      key: cni-chaining-mode

      name: cilium-config

      optional: true

    • name: CILIUM_CUSTOM_CNI_CONF

      valueFrom:

      configMapKeyRef:

      key: custom-cni-conf

      name: cilium-config

      optional: true

      image: docker.io/cilium/cilium:v1.7.6

      imagePullPolicy: IfNotPresent

      lifecycle:

      postStart:

      exec:

      command:



      • /cni-install.sh


      • --enable-debug=false


        preStop:


        exec:


        command:


      • /cni-uninstall.sh


        livenessProbe:


        exec:


        command:





        • cilium



        • status



        • --brief



          failureThreshold: 10



          initialDelaySeconds: 120



          periodSeconds: 30



          successThreshold: 1



          timeoutSeconds: 5



          name: cilium-agent



          readinessProbe:



          exec:



          command:



        • cilium



        • status



        • --brief



          failureThreshold: 3



          initialDelaySeconds: 5



          periodSeconds: 30



          successThreshold: 1



          timeoutSeconds: 5



          resources: {}



          securityContext:



          capabilities:



          add:



        • NET_ADMIN



        • SYS_MODULE



          privileged: true



          terminationMessagePath: /dev/termination-log



          terminationMessagePolicy: File



          volumeMounts:






    • mountPath: /var/run/cilium

      name: cilium-run

    • mountPath: /host/opt/cni/bin

      name: cni-path

    • mountPath: /host/etc/cni/net.d

      name: etc-cni-netd

    • mountPath: /var/lib/cilium/clustermesh

      name: clustermesh-secrets

      readOnly: true

    • mountPath: /tmp/cilium/config-map

      name: cilium-config-path

      readOnly: true

    • mountPath: /lib/modules

      name: lib-modules

      readOnly: true

    • mountPath: /run/xtables.lock

      name: xtables-lock

    • mountPath: /var/run/secrets/kubernetes.io/serviceaccount

      name: cilium-token-j74lr

      readOnly: true

      dnsPolicy: ClusterFirst

      enableServiceLinks: true

      hostNetwork: true

      initContainers:

  • command:

    • /init-container.sh

      env:

    • name: CILIUM_ALL_STATE

      valueFrom:

      configMapKeyRef:

      key: clean-cilium-state

      name: cilium-config

      optional: true

    • name: CILIUM_BPF_STATE

      valueFrom:

      configMapKeyRef:

      key: clean-cilium-bpf-state

      name: cilium-config

      optional: true

    • name: CILIUM_WAIT_BPF_MOUNT

      valueFrom:

      configMapKeyRef:

      key: wait-bpf-mount

      name: cilium-config

      optional: true

      image: docker.io/cilium/cilium:v1.7.6

      imagePullPolicy: IfNotPresent

      name: clean-cilium-state

      resources: {}

      securityContext:

      capabilities:

      add:



      • NET_ADMIN


        privileged: true


        terminationMessagePath: /dev/termination-log


        terminationMessagePolicy: File


        volumeMounts:



    • mountPath: /var/run/cilium

      name: cilium-run

    • mountPath: /var/run/secrets/kubernetes.io/serviceaccount

      name: cilium-token-j74lr

      readOnly: true

      priority: 2000001000

      priorityClassName: system-node-critical

      restartPolicy: Always

      schedulerName: default-scheduler

      securityContext: {}

      serviceAccount: cilium

      serviceAccountName: cilium

      terminationGracePeriodSeconds: 1

      tolerations:

  • operator: Exists
  • effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
  • effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
  • effect: NoSchedule
    key: node.kubernetes.io/disk-pressure
    operator: Exists
  • effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  • effect: NoSchedule
    key: node.kubernetes.io/pid-pressure
    operator: Exists
  • effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    operator: Exists
  • effect: NoSchedule
    key: node.kubernetes.io/network-unavailable
    operator: Exists
    volumes:
  • hostPath:
    path: /var/run/cilium
    type: DirectoryOrCreate
    name: cilium-run
  • hostPath:
    path: /opt/cni/bin
    type: DirectoryOrCreate
    name: cni-path
  • hostPath:
    path: /etc/cni/net.d
    type: DirectoryOrCreate
    name: etc-cni-netd
  • hostPath:
    path: /lib/modules
    type: ""
    name: lib-modules
  • hostPath:
    path: /run/xtables.lock
    type: FileOrCreate
    name: xtables-lock
  • name: clustermesh-secrets
    secret:
    defaultMode: 420
    optional: true
    secretName: cilium-clustermesh
  • configMap:
    defaultMode: 420
    name: cilium-config
    name: cilium-config-path
  • name: cilium-token-j74lr
    secret:
    defaultMode: 420
    secretName: cilium-token-j74lrstatus:
    conditions:
  • lastProbeTime: null
    lastTransitionTime: "2020-07-09T23:17:53Z"
    message: '0/6 nodes are available: 5 node(s) didn''t match node selector.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
    phase: Pending
    qosClass: BestEffort

The way I reproduce this is by spinning up new cluster with 2 masters and
3 worker nodes (using Cluster API) and applying Cilium 1.7.6:

---# Source: cilium/charts/agent/templates/serviceaccount.yamlapiVersion: v1kind: ServiceAccountmetadata:
name: cilium
namespace: kube-system
---# Source: cilium/charts/operator/templates/serviceaccount.yamlapiVersion: v1kind: ServiceAccountmetadata:
name: cilium-operator
namespace: kube-system
---# Source: cilium/charts/config/templates/configmap.yamlapiVersion: v1kind: ConfigMapmetadata:
name: cilium-config
namespace: kube-systemdata:

# Identity allocation mode selects how identities are shared between cilium
# nodes by setting how they are stored. The options are "crd" or "kvstore".
# - "crd" stores identities in kubernetes as CRDs (custom resource definition).
# These can be queried with:
# kubectl get ciliumid
# - "kvstore" stores identities in a kvstore, etcd or consul, that is
# configured below. Cilium versions before 1.6 supported only the kvstore
# backend. Upgrades from these older cilium versions should continue using
# the kvstore by commenting out the identity-allocation-mode below, or
# setting it to "kvstore".
identity-allocation-mode: crd

# If you want to run cilium in debug mode change this value to true
debug: "false"

# Enable IPv4 addressing. If enabled, all endpoints are allocated an IPv4
# address.
enable-ipv4: "true"

# Enable IPv6 addressing. If enabled, all endpoints are allocated an IPv6
# address.
enable-ipv6: "false"

# If you want cilium monitor to aggregate tracing for packets, set this level
# to "low", "medium", or "maximum". The higher the level, the less packets
# that will be seen in monitor output.
monitor-aggregation: medium

# The monitor aggregation interval governs the typical time between monitor
# notification events for each allowed connection.
#
# Only effective when monitor aggregation is set to "medium" or higher.
monitor-aggregation-interval: 5s

# The monitor aggregation flags determine which TCP flags which, upon the
# first observation, cause monitor notifications to be generated.
#
# Only effective when monitor aggregation is set to "medium" or higher.
monitor-aggregation-flags: all

# ct-global-max-entries-* specifies the maximum number of connections
# supported across all endpoints, split by protocol: tcp or other. One pair
# of maps uses these values for IPv4 connections, and another pair of maps
# use these values for IPv6 connections.
#
# If these values are modified, then during the next Cilium startup the
# tracking of ongoing connections may be disrupted. This may lead to brief
# policy drops or a change in loadbalancing decisions for a connection.
#
# For users upgrading from Cilium 1.2 or earlier, to minimize disruption
# during the upgrade process, comment out these options.
bpf-ct-global-tcp-max: "524288"
bpf-ct-global-any-max: "262144"

# bpf-policy-map-max specified the maximum number of entries in endpoint
# policy map (per endpoint)
bpf-policy-map-max: "16384"

# Pre-allocation of map entries allows per-packet latency to be reduced, at
# the expense of up-front memory allocation for the entries in the maps. The
# default value below will minimize memory usage in the default installation;
# users who are sensitive to latency may consider setting this to "true".
#
# This option was introduced in Cilium 1.4. Cilium 1.3 and earlier ignore
# this option and behave as though it is set to "true".
#
# If this value is modified, then during the next Cilium startup the restore
# of existing endpoints and tracking of ongoing connections may be disrupted.
# This may lead to policy drops or a change in loadbalancing decisions for a
# connection for some time. Endpoints may need to be recreated to restore
# connectivity.
#
# If this option is set to "false" during an upgrade from 1.3 or earlier to
# 1.4 or later, then it may cause one-time disruptions during the upgrade.
preallocate-bpf-maps: "false"

# Regular expression matching compatible Istio sidecar istio-proxy
# container image names
sidecar-istio-proxy-image: "cilium/istio_proxy"

# Encapsulation mode for communication between nodes
# Possible values:
# - disabled
# - vxlan (default)
# - geneve
tunnel: vxlan

# Name of the cluster. Only relevant when building a mesh of clusters.
cluster-name: default

# DNS Polling periodically issues a DNS lookup for each matchName from
# cilium-agent. The result is used to regenerate endpoint policy.
# DNS lookups are repeated with an interval of 5 seconds, and are made for
# A(IPv4) and AAAA(IPv6) addresses. Should a lookup fail, the most recent IP
# data is used instead. An IP change will trigger a regeneration of the Cilium
# policy for each endpoint and increment the per cilium-agent policy
# repository revision.
#
# This option is disabled by default starting from version 1.4.x in favor
# of a more powerful DNS proxy-based implementation, see [0] for details.
# Enable this option if you want to use FQDN policies but do not want to use
# the DNS proxy.
#
# To ease upgrade, users may opt to set this option to "true".
# Otherwise please refer to the Upgrade Guide [1] which explains how to
# prepare policy rules for upgrade.
#
# [0] http://docs.cilium.io/en/stable/policy/language/#dns-based
# [1] http://docs.cilium.io/en/stable/install/upgrade/#changes-that-may-require-action
tofqdns-enable-poller: "false"

# wait-bpf-mount makes init container wait until bpf filesystem is mounted
wait-bpf-mount: "false"

masquerade: "true"
enable-xt-socket-fallback: "true"
install-iptables-rules: "true"
auto-direct-node-routes: "false"
kube-proxy-replacement: "probe"
enable-host-reachable-services: "false"
enable-external-ips: "false"
enable-node-port: "false"
node-port-bind-protection: "true"
enable-auto-protect-node-port-range: "true"
enable-endpoint-health-checking: "true"
enable-well-known-identities: "false"
enable-remote-node-identity: "true"
---# Source: cilium/charts/agent/templates/clusterrole.yamlapiVersion: rbac.authorization.k8s.io/v1kind: ClusterRolemetadata:
name: ciliumrules:

  • apiGroups:

    • networking.k8s.io

      resources:

    • networkpolicies

      verbs:

    • get

    • list

    • watch

  • apiGroups:

    • discovery.k8s.io

      resources:

    • endpointslices

      verbs:

    • get

    • list

    • watch

  • apiGroups:

    • ""

      resources:

    • namespaces

    • services

    • nodes

    • endpoints

      verbs:

    • get

    • list

    • watch

  • apiGroups:

    • ""

      resources:

    • pods

    • nodes

      verbs:

    • get

    • list

    • watch

    • update

  • apiGroups:

    • ""

      resources:

    • nodes

    • nodes/status

      verbs:

    • patch

  • apiGroups:

    • apiextensions.k8s.io

      resources:

    • customresourcedefinitions

      verbs:

    • create

    • get

    • list

    • watch

    • update

  • apiGroups:

    • cilium.io

      resources:

    • ciliumnetworkpolicies

    • ciliumnetworkpolicies/status

    • ciliumclusterwidenetworkpolicies

    • ciliumclusterwidenetworkpolicies/status

    • ciliumendpoints

    • ciliumendpoints/status

    • ciliumnodes

    • ciliumnodes/status

    • ciliumidentities

    • ciliumidentities/status

      verbs:

    • '*'

      ---# Source: cilium/charts/operator/templates/clusterrole.yamlapiVersion: rbac.authorization.k8s.io/v1kind: ClusterRolemetadata:

      name: cilium-operatorrules:

  • apiGroups:

    • ""

      resources:

      # to automatically delete [core|kube]dns pods so that are starting to being

      # managed by Cilium

    • pods

      verbs:

    • get

    • list

    • watch

    • delete

  • apiGroups:

    • discovery.k8s.io

      resources:

    • endpointslices

      verbs:

    • get

    • list

    • watch

  • apiGroups:

    • ""

      resources:

      # to automatically read from k8s and import the node's pod CIDR to cilium's

      # etcd so all nodes know how to reach another pod running in in a different

      # node.

    • nodes

      # to perform the translation of a CNP that contains ToGroup to its endpoints

    • services

    • endpoints

      # to check apiserver connectivity

    • namespaces

      verbs:

    • get

    • list

    • watch

  • apiGroups:

    • cilium.io

      resources:

    • ciliumnetworkpolicies

    • ciliumnetworkpolicies/status

    • ciliumclusterwidenetworkpolicies

    • ciliumclusterwidenetworkpolicies/status

    • ciliumendpoints

    • ciliumendpoints/status

    • ciliumnodes

    • ciliumnodes/status

    • ciliumidentities

    • ciliumidentities/status

      verbs:

    • '*'

      ---# Source: cilium/charts/agent/templates/clusterrolebinding.yamlapiVersion: rbac.authorization.k8s.io/v1kind: ClusterRoleBindingmetadata:

      name: ciliumroleRef:

      apiGroup: rbac.authorization.k8s.io

      kind: ClusterRole

      name: ciliumsubjects:

  • kind: ServiceAccount
    name: cilium
    namespace: kube-system
    ---# Source: cilium/charts/operator/templates/clusterrolebinding.yamlapiVersion: rbac.authorization.k8s.io/v1kind: ClusterRoleBindingmetadata:
    name: cilium-operatorroleRef:
    apiGroup: rbac.authorization.k8s.io
    kind: ClusterRole
    name: cilium-operatorsubjects:
  • kind: ServiceAccount
    name: cilium-operator
    namespace: kube-system
    ---# Source: cilium/charts/agent/templates/daemonset.yamlapiVersion: apps/v1kind: DaemonSetmetadata:
    labels:
    k8s-app: cilium
    name: cilium
    namespace: kube-systemspec:
    selector:
    matchLabels:
    k8s-app: cilium
    template:
    metadata:
    annotations:
    # This annotation plus the CriticalAddonsOnly toleration makes
    # cilium to be a critical pod in the cluster, which ensures cilium
    # gets priority scheduling.
    # https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
    scheduler.alpha.kubernetes.io/critical-pod: ""
    labels:
    k8s-app: cilium
    spec:
    containers:

    • args:



      • --config-dir=/tmp/cilium/config-map


        command:


      • cilium-agent


        livenessProbe:


        exec:


        command:





        • cilium



        • status



        • --brief



          failureThreshold: 10



          # The initial delay for the liveness probe is intentionally large to



          # avoid an endless kill & restart cycle if in the event that the initial



          # bootstrapping takes longer than expected.



          initialDelaySeconds: 120



          periodSeconds: 30



          successThreshold: 1



          timeoutSeconds: 5



          readinessProbe:



          exec:



          command:



        • cilium



        • status



        • --brief



          failureThreshold: 3



          initialDelaySeconds: 5



          periodSeconds: 30



          successThreshold: 1



          timeoutSeconds: 5



          env:





      • name: K8S_NODE_NAME


        valueFrom:


        fieldRef:


        apiVersion: v1


        fieldPath: spec.nodeName


      • name: CILIUM_K8S_NAMESPACE


        valueFrom:


        fieldRef:


        apiVersion: v1


        fieldPath: metadata.namespace


      • name: CILIUM_FLANNEL_MASTER_DEVICE


        valueFrom:


        configMapKeyRef:


        key: flannel-master-device


        name: cilium-config


        optional: true


      • name: CILIUM_FLANNEL_UNINSTALL_ON_EXIT


        valueFrom:


        configMapKeyRef:


        key: flannel-uninstall-on-exit


        name: cilium-config


        optional: true


      • name: CILIUM_CLUSTERMESH_CONFIG


        value: /var/lib/cilium/clustermesh/


      • name: CILIUM_CNI_CHAINING_MODE


        valueFrom:


        configMapKeyRef:


        key: cni-chaining-mode


        name: cilium-config


        optional: true


      • name: CILIUM_CUSTOM_CNI_CONF


        valueFrom:


        configMapKeyRef:


        key: custom-cni-conf


        name: cilium-config


        optional: true


        image: "docker.io/cilium/cilium:v1.7.6"


        imagePullPolicy: IfNotPresent


        lifecycle:


        postStart:


        exec:


        command:





        • "/cni-install.sh"



        • "--enable-debug=false"



          preStop:



          exec:



          command:



        • /cni-uninstall.sh



          name: cilium-agent



          securityContext:



          capabilities:



          add:







          • NET_ADMIN




          • SYS_MODULE




            privileged: true




            volumeMounts:









      • mountPath: /var/run/cilium


        name: cilium-run


      • mountPath: /host/opt/cni/bin


        name: cni-path


      • mountPath: /host/etc/cni/net.d


        name: etc-cni-netd


      • mountPath: /var/lib/cilium/clustermesh


        name: clustermesh-secrets


        readOnly: true


      • mountPath: /tmp/cilium/config-map


        name: cilium-config-path


        readOnly: true


        # Needed to be able to load kernel modules


      • mountPath: /lib/modules


        name: lib-modules


        readOnly: true


      • mountPath: /run/xtables.lock


        name: xtables-lock


        hostNetwork: true


        initContainers:



    • command:



      • /init-container.sh


        env:


      • name: CILIUM_ALL_STATE


        valueFrom:


        configMapKeyRef:


        key: clean-cilium-state


        name: cilium-config


        optional: true


      • name: CILIUM_BPF_STATE


        valueFrom:


        configMapKeyRef:


        key: clean-cilium-bpf-state


        name: cilium-config


        optional: true


      • name: CILIUM_WAIT_BPF_MOUNT


        valueFrom:


        configMapKeyRef:


        key: wait-bpf-mount


        name: cilium-config


        optional: true


        image: "docker.io/cilium/cilium:v1.7.6"


        imagePullPolicy: IfNotPresent


        name: clean-cilium-state


        securityContext:


        capabilities:


        add:





        • NET_ADMIN



          privileged: true



          volumeMounts:





      • mountPath: /var/run/cilium


        name: cilium-run


        restartPolicy: Always


        priorityClassName: system-node-critical


        serviceAccount: cilium


        serviceAccountName: cilium


        terminationGracePeriodSeconds: 1


        tolerations:



    • operator: Exists

      volumes:

      # To keep state between restarts / upgrades

    • hostPath:

      path: /var/run/cilium

      type: DirectoryOrCreate

      name: cilium-run

      # To install cilium cni plugin in the host

    • hostPath:

      path: /opt/cni/bin

      type: DirectoryOrCreate

      name: cni-path

      # To install cilium cni configuration in the host

    • hostPath:

      path: /etc/cni/net.d

      type: DirectoryOrCreate

      name: etc-cni-netd

      # To be able to load kernel modules

    • hostPath:

      path: /lib/modules

      name: lib-modules

      # To access iptables concurrently with other processes (e.g. kube-proxy)

    • hostPath:

      path: /run/xtables.lock

      type: FileOrCreate

      name: xtables-lock

      # To read the clustermesh configuration

    • name: clustermesh-secrets

      secret:

      defaultMode: 420

      optional: true

      secretName: cilium-clustermesh

      # To read the configuration from the config map

    • configMap:

      name: cilium-config

      name: cilium-config-path

      updateStrategy:

      rollingUpdate:

      maxUnavailable: 2

      type: RollingUpdate

      ---# Source: cilium/charts/operator/templates/deployment.yamlapiVersion: apps/v1kind: Deploymentmetadata:

      labels:

      io.cilium/app: operator

      name: cilium-operator

      name: cilium-operator

      namespace: kube-systemspec:

      replicas: 1

      selector:

      matchLabels:

      io.cilium/app: operator

      name: cilium-operator

      strategy:

      rollingUpdate:

      maxSurge: 1

      maxUnavailable: 1

      type: RollingUpdate

      template:

      metadata:

      annotations:

      labels:

      io.cilium/app: operator

      name: cilium-operator

      spec:

      containers:

    • args:



      • --debug=$(CILIUM_DEBUG)


      • --identity-allocation-mode=$(CILIUM_IDENTITY_ALLOCATION_MODE)


      • --synchronize-k8s-nodes=true


        command:


      • cilium-operator


        env:


      • name: CILIUM_K8S_NAMESPACE


        valueFrom:


        fieldRef:


        apiVersion: v1


        fieldPath: metadata.namespace


      • name: K8S_NODE_NAME


        valueFrom:


        fieldRef:


        apiVersion: v1


        fieldPath: spec.nodeName


      • name: CILIUM_DEBUG


        valueFrom:


        configMapKeyRef:


        key: debug


        name: cilium-config


        optional: true


      • name: CILIUM_CLUSTER_NAME


        valueFrom:


        configMapKeyRef:


        key: cluster-name


        name: cilium-config


        optional: true


      • name: CILIUM_CLUSTER_ID


        valueFrom:


        configMapKeyRef:


        key: cluster-id


        name: cilium-config


        optional: true


      • name: CILIUM_IPAM


        valueFrom:


        configMapKeyRef:


        key: ipam


        name: cilium-config


        optional: true


      • name: CILIUM_DISABLE_ENDPOINT_CRD


        valueFrom:


        configMapKeyRef:


        key: disable-endpoint-crd


        name: cilium-config


        optional: true


      • name: CILIUM_KVSTORE


        valueFrom:


        configMapKeyRef:


        key: kvstore


        name: cilium-config


        optional: true


      • name: CILIUM_KVSTORE_OPT


        valueFrom:


        configMapKeyRef:


        key: kvstore-opt


        name: cilium-config


        optional: true


      • name: AWS_ACCESS_KEY_ID


        valueFrom:


        secretKeyRef:


        key: AWS_ACCESS_KEY_ID


        name: cilium-aws


        optional: true


      • name: AWS_SECRET_ACCESS_KEY


        valueFrom:


        secretKeyRef:


        key: AWS_SECRET_ACCESS_KEY


        name: cilium-aws


        optional: true


      • name: AWS_DEFAULT_REGION


        valueFrom:


        secretKeyRef:


        key: AWS_DEFAULT_REGION


        name: cilium-aws


        optional: true


      • name: CILIUM_IDENTITY_ALLOCATION_MODE


        valueFrom:


        configMapKeyRef:


        key: identity-allocation-mode


        name: cilium-config


        optional: true


        image: "docker.io/cilium/operator:v1.7.6"


        imagePullPolicy: IfNotPresent


        name: cilium-operator


        livenessProbe:


        httpGet:


        host: '127.0.0.1'


        path: /healthz


        port: 9234


        scheme: HTTP


        initialDelaySeconds: 60


        periodSeconds: 10


        timeoutSeconds: 3


        hostNetwork: true


        restartPolicy: Always


        serviceAccount: cilium-operator


        serviceAccountName: cilium-operator




You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/issues/91601#issuecomment-656404841,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAJ5E6BMTNCADT5K7D4PMF3R2ZJRVANCNFSM4NOTPEDA
.

Could you try increasing the loglevel and using grep to filter for the node
or the pod?

On Thu., Jul. 9, 2020, 7:55 p.m. dilyevsky, notifications@github.com
wrote:

Here's the scheduler log:

I0709 23:08:22.056081 1 registry.go:150] Registering EvenPodsSpread predicate and priority function
I0709 23:08:23.137451 1 serving.go:313] Generated self-signed cert in-memory
W0709 23:08:33.843509 1 authentication.go:297] Error looking up in-cluster authentication configuration: etcdserver: request timed out
W0709 23:08:33.843671 1 authentication.go:298] Continuing without authentication configuration. This may treat all requests as anonymous.
W0709 23:08:33.843710 1 authentication.go:299] To require authentication configuration lookup to succeed, set --authentication-tolerate-lookup-failure=false
I0709 23:08:33.911805 1 registry.go:150] Registering EvenPodsSpread predicate and priority function
I0709 23:08:33.911989 1 registry.go:150] Registering EvenPodsSpread predicate and priority function
W0709 23:08:33.917999 1 authorization.go:47] Authorization is disabled
W0709 23:08:33.918162 1 authentication.go:40] Authentication is disabled
I0709 23:08:33.918238 1 deprecated_insecure_serving.go:51] Serving healthz insecurely on [::]:10251
I0709 23:08:33.925860 1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0709 23:08:33.926013 1 shared_informer.go:223] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0709 23:08:33.930685 1 secure_serving.go:178] Serving securely on 127.0.0.1:10259
I0709 23:08:33.936198 1 tlsconfig.go:240] Starting DynamicServingCertificateController
I0709 23:08:34.026382 1 shared_informer.go:230] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0709 23:08:34.036998 1 leaderelection.go:242] attempting to acquire leader lease kube-system/kube-scheduler...
I0709 23:08:50.597201 1 leaderelection.go:252] successfully acquired lease kube-system/kube-scheduler
E0709 23:08:50.658551 1 factory.go:503] pod: kube-system/coredns-66bff467f8-9rjvd is already present in the active queue
E0709 23:12:27.673854 1 factory.go:503] pod kube-system/cilium-vv466 is already present in the backoff queue
E0709 23:12:58.099432 1 leaderelection.go:320] error retrieving resource lock kube-system/kube-scheduler: etcdserver: leader changed

After restarting scheduler pods, the pending pod immediately schedules.


You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/issues/91601#issuecomment-656406215,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAJ5E6E4QPGNNBFUYSZEJC3R2ZKHDANCNFSM4NOTPEDA
.

These are events:
```Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling default-scheduler 0/6 nodes are available: 5 node(s) didn't match node selector.
Warning FailedScheduling default-scheduler 0/6 nodes are available: 5 node(s) didn't match node selector.


The node only has two taints but the pod tolerates all existing taints and yeah it seems to only happen on masters:

Taints: node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/network-unavailable:NoSchedule


There is enough space and pod is best effort with no reservation anyway:
```  Resource                   Requests    Limits
  --------                   --------    ------
  cpu                        650m (32%)  0 (0%)
  memory                     70Mi (0%)   170Mi (2%)
  ephemeral-storage          0 (0%)      0 (0%)
  hugepages-1Gi              0 (0%)      0 (0%)
  hugepages-2Mi              0 (0%)      0 (0%)
  attachable-volumes-gce-pd  0           0

I'll try increasing scheduler log level now...

Your pod yaml doesn't actually have node-role.kubernetes.io/master toleration. So it shouldn't have been scheduled in the master.

Hi! We are hitting the same issue. However, we see the same problem with deployments, where we use anti-affinity to make sure a pod gets scheduled on each node or a pod selector targeting the specific node.
Simply creating a pod with a node selector set to match the hostname of the failing node was sufficient to cause the scheduling to fail. It was saying that 5 nodes were not matching the selector, but nothing about the sixth. Restarting the scheduler solved the issue. That looks like something gets cached about that node and prevents the scheduling on the node.
As other people said before, we have nothing in the log about the failure.

We striped the failing deployment to the bare minimum (we had removed the taint on the master that is failing):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-deployment
  labels:
    app: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
      restartPolicy: Always
      schedulerName: default-scheduler
      nodeSelector:
        kubernetes.io/hostname: master-2

We were having the same issue when the master had a taint, and the deployment a toleration for the taint. So it does not seem related to daemonsets, tolerations or affinity / anti-affinity specifically. When the failure starts happening, nothing that targets the specific node can be scheduled. We see the issue in 1.18.2 up to 1.18.5 (did not try with 1.18.0 or .1)

Simply creating a pod with a node selector set to match the hostname of the failing node was sufficient to cause the scheduling to fail

Could you clarify if it starting failing after you created such pod or before? I assume this node didn't have a taint that the pod didn't tolerate.

@nodo is going to help to reproduce. Could you look at the code for NodeSelector? You might need to add extra log lines while testing. You can also print the cache.

  • Get PID of kube-scheduler: $ pidof kube-scheduler
  • Trigger queue dump: $ sudo kill -SIGUSR2 <pid>. Note this won't kill the scheduler process.
  • Then in scheduler log, search for strings "Dump of cached NodeInfo", "Dump of scheduling queue" and "cache comparer started".

/priority critical-urgent

/unassign

We were already seeing some daemonset and deployment stuck in "Pending" before we tried deploying this test deployment, so it was already failing. and the taints had been removed from the node.
Right now we lost the environment where this was happening because we had to reboot the nodes so the issue is not visible anymore. As soon as we reproduce, we will try to come back with more info

Please do so. I have tried to reproduce this in the past without success. I'm more interested in the first instance of failure. It might still be related to taints.

We have reproduced the issue. I ran the command you asked for, here are the info :

I0716 14:47:52.768362       1 factory.go:462] Unable to schedule default/test-deployment-558f47bbbb-4rt5t: no fit: 0/6 nodes are available: 5 node(s) didn't match node selector.; waiting
I0716 14:47:52.768683       1 scheduler.go:776] Updating pod condition for default/test-deployment-558f47bbbb-4rt5t to (PodScheduled==False, Reason=Unschedulable)
I0716 14:47:53.018781       1 httplog.go:90] verb="GET" URI="/healthz" latency=299.172µs resp=200 UserAgent="kube-probe/1.18" srcIP="127.0.0.1:57258": 
I0716 14:47:59.469828       1 comparer.go:42] cache comparer started
I0716 14:47:59.470936       1 comparer.go:67] cache comparer finished
I0716 14:47:59.471038       1 dumper.go:47] Dump of cached NodeInfo
I0716 14:47:59.471484       1 dumper.go:49] 
Node name: master-0-bug
Requested Resources: {MilliCPU:1100 Memory:52428800 EphemeralStorage:0 AllowedPodNumber:0 ScalarResources:map[]}
Allocatable Resources:{MilliCPU:2000 Memory:3033427968 EphemeralStorage:19290208634 AllowedPodNumber:110 ScalarResources:map[hugepages-1Gi:0 hugepages-2Mi:0]}
Scheduled Pods(number: 9):
...
I0716 14:47:59.472623       1 dumper.go:60] Dump of scheduling queue:
name: coredns-cd64c8d7c-29zjq, namespace: kube-system, uid: 938e8827-5d17-4db9-ac04-d229baf4534a, phase: Pending, nominated node: 
name: test-deployment-558f47bbbb-4rt5t, namespace: default, uid: fa19fda9-c8d6-4ffe-b248-8ddd24ed5310, phase: Pending, nominated node: 

Unfortunately that does not seem to help

Dumping the cache is for debugging, it won't change anything. Could you please include the dump?

Also, assuming this was the first error, could you include the pod yaml and node?

that's pretty much everything that was dumped, I just removed the other nodes. This was not the first error, but you can see coredns pod in the dump, that was the first one. I am not sure what else you are asking for in the dump.
I'll fetch the yamls

Thanks, I didn't realize that you had trimmed the relevant node and pod.

Could you include the scheduled pods for that node though? Just in case there is a bug in resource usage calculations.

Requested Resources: {MilliCPU:1100 Memory:52428800 EphemeralStorage:0 AllowedPodNumber:0 ScalarResources:map[]}

That AllowedPodNumber: 0 seems odd.

Here are the other pods on that node :
` name: kube-controller-manager-master-0-bug, namespace: kube-system, uid: 095eebb0-4752-419b-aac7-245e5bc436b8, phase: Running, nominated node: name: kube-proxy-xwf6h, namespace: kube-system, uid: 16552eaf-9eb8-4584-ba3c-7dff6ce92592, phase: Running, nominated node: name: kube-apiserver-master-0-bug, namespace: kube-system, uid: 1d338e26-b0bc-4cef-9bad-86b7dd2b2385, phase: Running, nominated node: name: kube-multus-ds-amd64-tpkm8, namespace: kube-system, uid: d50c0c7f-599c-41d5-a029-b43352a4f5b8, phase: Running, nominated node: name: openstack-cloud-controller-manager-wrb8n, namespace: kube-system, uid: 17aeb589-84a1-4416-a701-db6d8ef60591, phase: Running, nominated node: name: kube-scheduler-master-0-bug, namespace: kube-system, uid: 52469084-3122-4e99-92f6-453e512b640f, phase: Running, nominated node: name: subport-controller-28j9v, namespace: kube-system, uid: a5a07ac8-763a-4ff2-bdae-91c6e9e95698, phase: Running, nominated node: name: csi-cinder-controllerplugin-0, namespace: kube-system, uid: 8b16d6c8-a871-454e-98a3-0aa545f9c9d0, phase: Running, nominated node: name: calico-node-d899t, namespace: kube-system, uid: e3672030-53b1-4356-a5df-0f4afd6b9237, phase: Running, nominated node:

All the nodes have allowedPodNumber set to 0 in the requested resources in the dump, but the other nodes are schedulable

The node yaml :

apiVersion: v1
kind: Node
metadata:
  annotations:
    kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
    node.alpha.kubernetes.io/ttl: "0"
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: "2020-07-16T09:59:48Z"
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/instance-type: 54019dbc-10d7-409c-8338-5556f61a9371
    beta.kubernetes.io/os: linux
    failure-domain.beta.kubernetes.io/region: regionOne
    failure-domain.beta.kubernetes.io/zone: nova
    kubernetes.io/arch: amd64
    kubernetes.io/hostname: master-0-bug
    kubernetes.io/os: linux
    node-role.kubernetes.io/master: ""
    node.kubernetes.io/instance-type: 54019dbc-10d7-409c-8338-5556f61a9371
    node.uuid: 00324054-405e-4fae-a3bf-d8509d511ded
    node.uuid_source: cloud-init
    topology.kubernetes.io/region: regionOne
    topology.kubernetes.io/zone: nova
  name: master-0-bug
  resourceVersion: "85697"
  selfLink: /api/v1/nodes/master-0-bug
  uid: 629b6ef3-3c76-455b-8b6b-196c4754fb0e
spec:
  podCIDR: 192.168.0.0/24
  podCIDRs:
  - 192.168.0.0/24
  providerID: openstack:///00324054-405e-4fae-a3bf-d8509d511ded
  taints:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
status:
  addresses:
  - address: 10.0.10.14
    type: InternalIP
  - address: master-0-bug
    type: Hostname
  allocatable:
    cpu: "2"
    ephemeral-storage: "19290208634"
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 2962332Ki
    pods: "110"
  capacity:
    cpu: "2"
    ephemeral-storage: 20931216Ki
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 3064732Ki
    pods: "110"
  conditions:
  - lastHeartbeatTime: "2020-07-16T10:02:20Z"
    lastTransitionTime: "2020-07-16T10:02:20Z"
    message: Calico is running on this node
    reason: CalicoIsUp
    status: "False"
    type: NetworkUnavailable
  - lastHeartbeatTime: "2020-07-16T15:46:11Z"
    lastTransitionTime: "2020-07-16T09:59:43Z"
    message: kubelet has sufficient memory available
    reason: KubeletHasSufficientMemory
    status: "False"
    type: MemoryPressure
  - lastHeartbeatTime: "2020-07-16T15:46:11Z"
    lastTransitionTime: "2020-07-16T09:59:43Z"
    message: kubelet has no disk pressure
    reason: KubeletHasNoDiskPressure
    status: "False"
    type: DiskPressure
  - lastHeartbeatTime: "2020-07-16T15:46:11Z"
    lastTransitionTime: "2020-07-16T09:59:43Z"
    message: kubelet has sufficient PID available
    reason: KubeletHasSufficientPID
    status: "False"
    type: PIDPressure
  - lastHeartbeatTime: "2020-07-16T15:46:11Z"
    lastTransitionTime: "2020-07-16T10:19:44Z"
    message: kubelet is posting ready status. AppArmor enabled
    reason: KubeletReady
    status: "True"
    type: Ready
  daemonEndpoints:
    kubeletEndpoint:
      Port: 10250
  nodeInfo:
    architecture: amd64
    bootID: fe410ed3-2825-4f94-a9f9-08dc5e6a955e
    containerRuntimeVersion: docker://19.3.11
    kernelVersion: 4.12.14-197.45-default
    kubeProxyVersion: v1.18.5
    kubeletVersion: v1.18.5
    machineID: 00324054405e4faea3bfd8509d511ded
    operatingSystem: linux
    systemUUID: 00324054-405e-4fae-a3bf-d8509d511ded

and the pod :

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2020-07-16T10:13:35Z"
  generateName: pm-node-exporter-
  labels:
    controller-revision-hash: 6466d9c7b
    pod-template-generation: "1"
  name: pm-node-exporter-mn9vj
  namespace: monitoring
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: DaemonSet
    name: pm-node-exporter
    uid: 5855a26f-a57e-4b0e-93f2-461c19c477e1
  resourceVersion: "5239"
  selfLink: /api/v1/namespaces/monitoring/pods/pm-node-exporter-mn9vj
  uid: 0db09c9c-1618-4454-94fa-138e55e5ebd7
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchFields:
          - key: metadata.name
            operator: In
            values:
            - master-0-bug
  containers:
  - args:
    - --path.procfs=/host/proc
    - --path.sysfs=/host/sys
    image: ***
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /
        port: 9100
        scheme: HTTP
      initialDelaySeconds: 5
      periodSeconds: 5
      successThreshold: 1
      timeoutSeconds: 1
    name: pm-node-exporter
    ports:
    - containerPort: 9100
      hostPort: 9100
      name: metrics
      protocol: TCP
    resources:
      limits:
        cpu: 200m
        memory: 150Mi
      requests:
        cpu: 100m
        memory: 100Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /host/proc
      name: proc
      readOnly: true
    - mountPath: /host/sys
      name: sys
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: pm-node-exporter-token-csllf
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostNetwork: true
  hostPID: true
  nodeSelector:
    node-role.kubernetes.io/master: ""
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: pm-node-exporter
  serviceAccountName: pm-node-exporter
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/disk-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/pid-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/network-unavailable
    operator: Exists
  volumes:
  - hostPath:
      path: /proc
      type: ""
    name: proc
  - hostPath:
      path: /sys
      type: ""
    name: sys
  - name: pm-node-exporter-token-csllf
    secret:
      defaultMode: 420
      secretName: pm-node-exporter-token-csllf
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2020-07-16T10:13:35Z"
    message: '0/6 nodes are available: 2 node(s) didn''t have free ports for the requested
      pod ports, 3 node(s) didn''t match node selector.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending
  qosClass: Burstable

Thanks a lot for all the information. @nodo can you take it?

/help

@maelk feel free to take this and submit a PR if you find the bug. The log lines you added are likely to be helpful. Otherwise, I'm opening to contributors.

@alculquicondor:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

@maelk feel free to take this and submit a PR if you find the bug. The log lines you added are likely to be helpful. Otherwise, I'm opening to contributors.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

/assign

@maelk Is there anything specific to timing when this issue occurs first time? For example, does it happen right after node starts?

No, there quite some pods that get scheduled there and run fine. But once the issue happens no of can be scheduled anymore.

Lowering priority until we have a reproducible case.

We were able to reproduce the bug with a scheduler that had the additional log entries. What we see is that one of the masters completely disappears from the list of nodes that are iterated through. We can see that the process starts with the 6 nodes (from the snapshot) :

I0720 13:58:28.246507       1 generic_scheduler.go:441] Looking for a node for kube-system/coredns-cd64c8d7c-tcxbq, going through []*nodeinfo.NodeInfo{(*nodeinfo.NodeInfo)(0xc000326a90), (*nodeinfo.NodeInfo)(0xc000952000), (*nodeinfo.NodeInfo)(0xc0007d08f0), (*nodeinfo.NodeInfo)(0xc0004f35f0), (*nodeinfo.NodeInfo)(0xc000607040), (*nodeinfo.NodeInfo)(0xc000952000)}

but after, that we can see that it iterates only over 5 nodes, and then we get :

I0720 13:58:28.247420       1 generic_scheduler.go:505] pod kube-system/coredns-cd64c8d7c-tcxbq : processed 5 nodes, 0 fit

So one of the nodes is removed from the list of potential nodes. Unfortunately we did not have enough loging at the start of the process, but we'll try to get more.

Code references by log line:

  1. https://github.com/Nordix/kubernetes/commit/5c00cdf195fa61316f963f59e73c6cafc2ad9bdc#diff-c237cdd9e4cb201118ca380732d7f361R441
  2. https://github.com/Nordix/kubernetes/commit/5c00cdf195fa61316f963f59e73c6cafc2ad9bdc#diff-c237cdd9e4cb201118ca380732d7f361R505

@maelk
Did you see any lines for %v/%v on node %v, too many nodes fit?

Otherwise, @pancernik could you check for bugs on workqueue.ParallelizeUntil(ctx, 16, len(allNodes), checkNode)?

No, that log did not appear. I would also think it could be that we either have an issue with the parallelization or that node is filtered out earlier. If it was failing with an error here : https://github.com/Nordix/kubernetes/commit/5c00cdf195fa61316f963f59e73c6cafc2ad9bdc#diff-c237cdd9e4cb201118ca380732d7f361R464 it would be visible in the logs afaik, so I'll try to add more debug entries around specifically the function and the parallelization.

I just realized that one node is going through the filtering twice!

The logs are :

I0720 13:58:28.246507       1 generic_scheduler.go:441] Looking for a node for kube-system/coredns-cd64c8d7c-tcxbq, going through []*nodeinfo.NodeInfo{(*nodeinfo.NodeInfo)(0xc000326a90), (*nodeinfo.NodeInfo)(0xc000952000), (*nodeinfo.NodeInfo)(0xc0007d08f0), (*nodeinfo.NodeInfo)(0xc0004f35f0), (*nodeinfo.NodeInfo)(0xc000607040), (*nodeinfo.NodeInfo)(0xc000952000)}
I0720 13:58:28.246793       1 generic_scheduler.go:469] pod kube-system/coredns-cd64c8d7c-tcxbq on node worker-pool1-60846k0y-scheduler, fits: false, status: &v1alpha1.Status{code:3, reasons:[]string{"node(s) didn't match node selector"}}
I0720 13:58:28.246970       1 generic_scheduler.go:483] pod kube-system/coredns-cd64c8d7c-tcxbq on node worker-pool1-60846k0y-scheduler : status is not success
I0720 13:58:28.246819       1 taint_toleration.go:71] Checking taints for pod kube-system/coredns-cd64c8d7c-tcxbq for node master-0-scheduler : taints : []v1.Taint{v1.Taint{Key:"node-role.kubernetes.io/master", Value:"", Effect:"NoSchedule", TimeAdded:(*v1.Time)(nil)}} and tolerations: []v1.Toleration{v1.Toleration{Key:"node-role.kubernetes.io/master", Operator:"Exists", Value:"", Effect:"NoSchedule", TolerationSeconds:(*int64)(nil)}, v1.Toleration{Key:"CriticalAddonsOnly", Operator:"Exists", Value:"", Effect:"NoSchedule", TolerationSeconds:(*int64)(nil)}, v1.Toleration{Key:"node-role.kubernetes.io/master", Operator:"Exists", Value:"", Effect:"NoExecute", TolerationSeconds:(*int64)(nil)}, v1.Toleration{Key:"node-role.kubernetes.io/not-ready", Operator:"Exists", Value:"", Effect:"NoSchedule", TolerationSeconds:(*int64)(nil)}, v1.Toleration{Key:"node.kubernetes.io/not-ready", Operator:"Exists", Value:"", Effect:"NoExecute", TolerationSeconds:(*int64)(0xc000d40d90)}, v1.Toleration{Key:"node.kubernetes.io/unreachable", Operator:"Exists", Value:"", Effect:"NoExecute", TolerationSeconds:(*int64)(0xc000d40db0)}}
I0720 13:58:28.247019       1 taint_toleration.go:71] Checking taints for pod kube-system/coredns-cd64c8d7c-tcxbq for node master-2-scheduler : taints : []v1.Taint{v1.Taint{Key:"node-role.kubernetes.io/master", Value:"", Effect:"NoSchedule", TimeAdded:(*v1.Time)(nil)}} and tolerations: []v1.Toleration{v1.Toleration{Key:"node-role.kubernetes.io/master", Operator:"Exists", Value:"", Effect:"NoSchedule", TolerationSeconds:(*int64)(nil)}, v1.Toleration{Key:"CriticalAddonsOnly", Operator:"Exists", Value:"", Effect:"NoSchedule", TolerationSeconds:(*int64)(nil)}, v1.Toleration{Key:"node-role.kubernetes.io/master", Operator:"Exists", Value:"", Effect:"NoExecute", TolerationSeconds:(*int64)(nil)}, v1.Toleration{Key:"node-role.kubernetes.io/not-ready", Operator:"Exists", Value:"", Effect:"NoSchedule", TolerationSeconds:(*int64)(nil)}, v1.Toleration{Key:"node.kubernetes.io/not-ready", Operator:"Exists", Value:"", Effect:"NoExecute", TolerationSeconds:(*int64)(0xc000d40d90)}, v1.Toleration{Key:"node.kubernetes.io/unreachable", Operator:"Exists", Value:"", Effect:"NoExecute", TolerationSeconds:(*int64)(0xc000d40db0)}}
I0720 13:58:28.247144       1 generic_scheduler.go:469] pod kube-system/coredns-cd64c8d7c-tcxbq on node master-2-scheduler, fits: false, status: &v1alpha1.Status{code:2, reasons:[]string{"node(s) didn't match pod affinity/anti-affinity", "node(s) didn't satisfy existing pods anti-affinity rules"}}
I0720 13:58:28.247172       1 generic_scheduler.go:483] pod kube-system/coredns-cd64c8d7c-tcxbq on node master-2-scheduler : status is not success
I0720 13:58:28.247210       1 generic_scheduler.go:469] pod kube-system/coredns-cd64c8d7c-tcxbq on node worker-pool1-7dt1xd4k-scheduler, fits: false, status: &v1alpha1.Status{code:3, reasons:[]string{"node(s) didn't match node selector"}}
I0720 13:58:28.247231       1 generic_scheduler.go:483] pod kube-system/coredns-cd64c8d7c-tcxbq on node worker-pool1-7dt1xd4k-scheduler : status is not success
I0720 13:58:28.247206       1 generic_scheduler.go:469] pod kube-system/coredns-cd64c8d7c-tcxbq on node worker-pool1-60846k0y-scheduler, fits: false, status: &v1alpha1.Status{code:3, reasons:[]string{"node(s) didn't match node selector"}}
I0720 13:58:28.247297       1 generic_scheduler.go:483] pod kube-system/coredns-cd64c8d7c-tcxbq on node worker-pool1-60846k0y-scheduler : status is not success
I0720 13:58:28.247246       1 generic_scheduler.go:469] pod kube-system/coredns-cd64c8d7c-tcxbq on node worker-pool1-hyk0hg7r-scheduler, fits: false, status: &v1alpha1.Status{code:3, reasons:[]string{"node(s) didn't match node selector"}}
I0720 13:58:28.247340       1 generic_scheduler.go:483] pod kube-system/coredns-cd64c8d7c-tcxbq on node worker-pool1-hyk0hg7r-scheduler : status is not success
I0720 13:58:28.247147       1 generic_scheduler.go:469] pod kube-system/coredns-cd64c8d7c-tcxbq on node master-0-scheduler, fits: false, status: &v1alpha1.Status{code:2, reasons:[]string{"node(s) didn't match pod affinity/anti-affinity", "node(s) didn't satisfy existing pods anti-affinity rules"}}
I0720 13:58:28.247375       1 generic_scheduler.go:483] pod kube-system/coredns-cd64c8d7c-tcxbq on node master-0-scheduler : status is not success
I0720 13:58:28.247420       1 generic_scheduler.go:505] pod kube-system/coredns-cd64c8d7c-tcxbq : processed 5 nodes, 0 fit
I0720 13:58:28.247461       1 generic_scheduler.go:430] pod kube-system/coredns-cd64c8d7c-tcxbq After scheduling, filtered: []*v1.Node{}, filtered nodes: v1alpha1.NodeToStatusMap{"master-0-scheduler":(*v1alpha1.Status)(0xc000d824a0), "master-2-scheduler":(*v1alpha1.Status)(0xc000b736c0), "worker-pool1-60846k0y-scheduler":(*v1alpha1.Status)(0xc000d825a0), "worker-pool1-7dt1xd4k-scheduler":(*v1alpha1.Status)(0xc000b737e0), "worker-pool1-hyk0hg7r-scheduler":(*v1alpha1.Status)(0xc000b738c0)}
I0720 13:58:28.247527       1 generic_scheduler.go:185] Pod kube-system/coredns-cd64c8d7c-tcxbq failed scheduling:
  nodes snapshot: &cache.Snapshot{nodeInfoMap:map[string]*nodeinfo.NodeInfo{"master-0-scheduler":(*nodeinfo.NodeInfo)(0xc000607040), "master-1-scheduler":(*nodeinfo.NodeInfo)(0xc0001071e0), "master-2-scheduler":(*nodeinfo.NodeInfo)(0xc000326a90), "worker-pool1-60846k0y-scheduler":(*nodeinfo.NodeInfo)(0xc000952000), "worker-pool1-7dt1xd4k-scheduler":(*nodeinfo.NodeInfo)(0xc0007d08f0), "worker-pool1-hyk0hg7r-scheduler":(*nodeinfo.NodeInfo)(0xc0004f35f0)}, nodeInfoList:[]*nodeinfo.NodeInfo{(*nodeinfo.NodeInfo)(0xc000326a90), (*nodeinfo.NodeInfo)(0xc000952000), (*nodeinfo.NodeInfo)(0xc0007d08f0), (*nodeinfo.NodeInfo)(0xc0004f35f0), (*nodeinfo.NodeInfo)(0xc000607040), (*nodeinfo.NodeInfo)(0xc000952000)}, havePodsWithAffinityNodeInfoList:[]*nodeinfo.NodeInfo{(*nodeinfo.NodeInfo)(0xc000326a90), (*nodeinfo.NodeInfo)(0xc000607040)}, generation:857} 
  statuses: v1alpha1.NodeToStatusMap{"master-0-scheduler":(*v1alpha1.Status)(0xc000d824a0), "master-2-scheduler":(*v1alpha1.Status)(0xc000b736c0), "worker-pool1-60846k0y-scheduler":(*v1alpha1.Status)(0xc000d825a0), "worker-pool1-7dt1xd4k-scheduler":(*v1alpha1.Status)(0xc000b737e0), "worker-pool1-hyk0hg7r-scheduler":(*v1alpha1.Status)(0xc000b738c0)} 

As you can see the node worker-pool1-60846k0y-scheduler goes twice through filtering

No, that log did not appear. I would also think it could be that we either have an issue with the parallelization or that node is filtered out earlier. If it was failing with an error here : Nordix@5c00cdf#diff-c237cdd9e4cb201118ca380732d7f361R464 it would be visible in the logs afaik, so I'll try to add more debug entries around specifically the function and the parallelization.

Yeah, an error there would manifest as an Scheduling Error in the pod events.

I just realized that one node is going through the filtering twice!

I honestly wouldn't think that parallelization has bugs (still worth checking), but this could be a sign that we failed to build the snapshot from the cache (as we saw from the cache dump, the cache is correct), by adding a node twice. Since statuses is a map, then it makes sense that we only "see" 5 nodes at the last log line.

This is the code (tip of 1.18) https://github.com/kubernetes/kubernetes/blob/ec73e191f47b7992c2f40fadf1389446d6661d6d/pkg/scheduler/internal/cache/cache.go#L203

cc @ahg-g

I will try to add a lot of logs on the cache part of the scheduler, specifically around node addition and update, and around the snapshot. However, from the last line of the logs, you can see that the snapshot is actually correct, and contains all the nodes, so whatever happens seem to happen later on, when working over that snapshot

cache != snapshot

Cache is the living thing that gets updated from events. The snapshot is updated (from the cache) before each scheduling cycle to "lock" the state. We added optimizations to make this last process as fast as possible. It's possible that the bug is there.

Thanks @maelk! This is very useful. Your logs indicate that (*nodeinfo.NodeInfo)(0xc000952000) is duplicated in the list already at https://github.com/Nordix/kubernetes/commit/5c00cdf195fa61316f963f59e73c6cafc2ad9bdc#diff-c237cdd9e4cb201118ca380732d7f361R441 before any parallel code gets executed. That would indeed mean that it's duplicated before the snapshot is updated.

Actually, that comes from the snapshot, that happens before this log message : https://github.com/Nordix/kubernetes/commit/5c00cdf195fa61316f963f59e73c6cafc2ad9bdc#diff-c237cdd9e4cb201118ca380732d7f361R161 . So it rather looks like the content of the snapshot has the duplication, since it comes from https://github.com/Nordix/kubernetes/commit/5c00cdf195fa61316f963f59e73c6cafc2ad9bdc#diff-c237cdd9e4cb201118ca380732d7f361R436

That's right. I meant it's already duplicated before the update of the snapshot finishes.

That's right. I meant it's already duplicated before the update of the snapshot finishes.

No, The snapshot is updated at the start of the scheduling cycle. The bug is either during snapshot update or before that. But the cache is correct, according to the dump in https://github.com/kubernetes/kubernetes/issues/91601#issuecomment-659465008

EDIT: I read it wrong, I didn't see the word "finishes" :)

I wonder if the node tree also has duplicate records

I wonder if the node tree also has duplicate records

@maelk could you show a dump of the full list of nodes in the cache?

we don't add/remove items from NodeInfoList, we either create the full list from the tree or not, so if there are duplicates, those are likely coming from the tree, I think.

Just to clarify:
1) the cluster has 6 nodes (including the masters)
2) the node that is supposed to host the pod wasn't examined at all (no log line indicating that), which may mean it is not in NodeInfoList at all
3) NodeInfoList has 6 nodes, but one of them is duplicate

I wonder if the node tree also has duplicate records

@maelk could you show a dump of the full list of nodes in the cache?

a dump of each the node tree, list and map would be great.

I'll work on getting those. In the meantime, a small update. We can see in the logs :

I0720 13:37:30.530980       1 node_tree.go:100] Removed node "worker-pool1-60846k0y-scheduler" in group "" from NodeTree
I0720 13:37:30.531136       1 node_tree.go:86] Added node "worker-pool1-60846k0y-scheduler" in group "regionOne:\x00:nova" to NodeTree

And that is the exact point when the missing node disappears. The last occurence in the logs is at 13:37:24. In the next scheduling, the missing node is gone. so it looks like the bug is in /follows the update of the node_tree. All nodes go through that update, it's just that this worker 608 is the last one to go through it.

When dumping the cache (with SIGUSR2) all six nodes are listed there, with the pods running on the nodes, without duplication or missing nodes.

We'll give it a new try with added debug around the snapshot functionality : https://github.com/Nordix/kubernetes/commit/53279fb06536558f9a91836c771b182791153791

Removed node "worker-pool1-60846k0y-scheduler" in group "" from NodeTree

Interesting, I think the remove/add are triggered by an updateNode call. The zone key is missing on the remove, but exist on the add, so the update was basically adding the zone and region labels?

Do you have other scheduler logs related to this node?

We're trying to reproduce the bug with the added logging. I'll come back when I have more info

I'll work on getting those. In the meantime, a small update. We can see in the logs :

I0720 13:37:30.530980       1 node_tree.go:100] Removed node "worker-pool1-60846k0y-scheduler" in group "" from NodeTree
I0720 13:37:30.531136       1 node_tree.go:86] Added node "worker-pool1-60846k0y-scheduler" in group "regionOne:\x00:nova" to NodeTree

I'll point out that such node is the node that gets repeated. @maelk, did you see similar messages for other nodes or not at all? As @ahg-g, this should be expected when a node receive it's topology labels for the first time.

yes, it happened for all nodes, and it is expected. the coincidence being that this node specifically is the last updated one, and it is at that exact time that the other node goes missing

Did you get update logs for the missing node?

Did you get update logs for the missing node?

lol, was just typing this question.

perhaps the bug is that whole zone is deleted from the tree before all nodes are removed.

Just to clarify, I'm not personally looking at the code, I'm just trying to make sure we have all the information. And I think that, with what we have now, we should be able to spot the bug. Feel free to submit PRs and much better if you can provide a unit test that fails.

Did you get update logs for the missing node?

yes, it shows that the zone is updated for that missing node. There is a log entry for all nodes

To be honest, I still have no clue about the reason of the bug, but if we can get close to finding it out, I'll submit a PR or unit tests.

yes, it shows that the zone is updated for that missing node. There is a log entry for all nodes

If so, then I think assuming that this "is the exact point when the missing node disappears." may not be correlated. Lets wait for the new logs. It would be great if you can share all scheduler logs you get in a file.

I'll do when we reproduce with the new logging. From the existing ones, we can actually see that the pod scheduling immediately after that update is the first one to fail. But it does not give enough information to know what happened in between, so stay tuned...

@maelk Have you seen a message starting with snapshot state is not consistent in the scheduler logs?

Would it be possible for you to provide full scheduler logs?

no, that message is not present. I could give a striped down log file (to avoid repetition), but let's first wait until we have the output with more logs around the snapshot

I have found the bug. The issue is with the nodeTree next() function, that does not return a list of all nodes in some cases. https://github.com/kubernetes/kubernetes/blob/release-1.18/pkg/scheduler/internal/cache/node_tree.go#L147

It is visible if you add the following here : https://github.com/kubernetes/kubernetes/blob/release-1.18/pkg/scheduler/internal/cache/node_tree_test.go#L443

{
    name:           "add nodes to a new and to an exhausted zone",
    nodesToAdd:     append(allNodes[5:9], allNodes[3]),
    nodesToRemove:  nil,
    operations:     []string{"add", "add", "next", "next", "add", "add", "add", "next", "next", "next", "next"},
    expectedOutput: []string{"node-6", "node-7", "node-3", "node-8", "node-6", "node-7"},
},

The main problem is that when you add a node, the indexes are not at 0 for some of the zones. For this to happen you must have at least two zones, one being shorter than the other, and the longer one having an index not set to 0 when calling the next function for the first time.

The fix I went with is to reset the index before calling next() the first time. I opened a PR to show my fix. Of course it is against the 1.18 release as this is what I have been working on, but it is mostly for discussing how to fix it (or maybe fix the next() function itself). I can open a proper PR towards master and do the backports if needed afterwards.

I noticed the same problem with iteration. But I failed to link that to a duplicate in the snapshot. Have you managed to create a scenario where that would happen, @maelk?

yes, you can run it in the unit tests by adding the small code I put

I am now working on adding a test case for the snapshot, to make sure this is properly tested.

big thumbs up to @igraecao for the help in reproducing the issue and running the tests in his setup

Thanks all for debugging this notorious issue. Resetting the index before creating the list is safe, so I think we should go with that for 1.18 and 1.19 patches, and have a proper fix in the master branch.

The purpose of the next function changed with the introduction of the NodeInfoList, and so we can certainly simplify it and perhaps change it to toList, a function that creates a list from the tree and simply start from the beginning every time.

I understand the issue now: The calculation of whether or not a zone is exhausted is wrong because it doesn't consider where in each zone we started this "UpdateSnapshot" process. And yeah, it would only be visible with uneven zones.

Great job spotting this @maelk!

I would think we have the same issue in older versions. However, it is hidden by the fact that we do a tree pass every time. Whereas in 1.18 we snapshot the result until there are changes in the tree.

Now that the round-robin strategy is implemented in generic_scheduler.go, we might be fine with simply resetting all counters before UpdateSnapshot, as your PR is doing.

https://github.com/kubernetes/kubernetes/blob/02cf58102a61b6d1e021e256381ff750573ce55d/pkg/scheduler/core/generic_scheduler.go#L357

Just to double check @ahg-g, this should be fine even in a cluster were new nodes are added/removed all the time, right?

Thanks @maelk for spotting the root cause!

The purpose of the next function changed with the introduction of the NodeInfoList, and so we can certainly simplify it and perhaps change it to toList, a function that creates a list from the tree and simply start from the beginning every time.

Given that cache.nodeTree.next() is only called in building the snapshot nodeInfoList, I think it's also safe to remove the indexes (both zoneIndex and nodeIndex) from nodeTree struct. Instead, come up with a simple nodeIterator() function to iterate through its zone/node in a round-robin manner.

BTW: there is a typo in https://github.com/kubernetes/kubernetes/issues/91601#issuecomment-662663090, the case should be:

{
    name:           "add nodes to a new and to an exhausted zone",
    nodesToAdd:     append(allNodes[6:9], allNodes[3]),
    nodesToRemove:  nil,
    operations:     []string{"add", "add", "next", "next", "add", "add", "next", "next", "next", "next"},
    expectedOutput: []string{"node-6", "node-7", "node-3", "node-8", "node-6", "node-7"},
    // with codecase on master and 1.18, its output is [node-6 node-7 node-3 node-8 node-6 node-3]
},

Just to double check @ahg-g, this should be fine even in a cluster were new nodes are added/removed all the time, right?

I am assuming you are talking about the logic in generic_scheduler.go, if so yes, it is doesn't matter much if nodes were added or removed, the main thing we need to avoid is iterating over the nodes in the same order every time we schedule a pod, we just need a good approximation of iterating over the nodes across pods.

Given that cache.nodeTree.next() is only called in building the snapshot nodeInfoList, I think it's also safe to remove the indexes (both zoneIndex and nodeIndex) from nodeTree struct. Instead, come up with a simple nodeIterator() function to iterate through its zone/node in a round-robin manner.

yes, we just need to iterate over all zones/nodes in the same order every time.

I have updated the PR with a unit test for the function updating the snapshotlist, for that bug specifically. I can also take care of refactoring the next() function to iterate over the zones and nodes without round-robin, hence removing the issue.

Thanks, sounds good, but we should still iterate between zones the same way we do now, that is by design.

I don't really get what you mean here. Is it so that the order of the nodes matter and we must still go round-robin between zones or can we list all nodes of a zone, one zone after the other ? Let's say that you have two zones of two nodes each, in which order do you expect them, or does it even matter at all ?

The order matters, we need to alternate between zones while creating the list. If you have two zones of two nodes each z1: {n11, n12} and z2: {n21, n22}, then the list should be {n11, n21, n12, n22}

ok, thanks, I'll give it a thought. Can we meanwhile proceed with the quick fix ? btw, some tests are failing on it, but I am not sure how that relates to my PR

Those are flakes. Please send a patch to 1.18 as well.

Ok, will do. Thanks

{
  name:           "add nodes to a new and to an exhausted zone",
  nodesToAdd:     append(allNodes[5:9], allNodes[3]),
  nodesToRemove:  nil,
  operations:     []string{"add", "add", "next", "next", "add", "add", "add", "next", "next", "next", "next"},
  expectedOutput: []string{"node-6", "node-7", "node-3", "node-8", "node-6", "node-7"},
},

@maelk, do you mean this test ignore the 'node-5'?

I found after fixed the append in https://github.com/kubernetes/kubernetes/pull/93516, the test result all the nodes can be iterated:

{
            name:           "add nodes to a new and to an exhausted zone",
            nodesToAdd:     append(append(make([]*v1.Node, 0), allNodes[5:9]...), allNodes[3]),
            nodesToRemove:  nil,
            operations:     []string{"add", "add", "next", "next", "add", "add", "add", "next", "next", "next", "next"},
            expectedOutput: []string{"node-5", "node-6", "node-3", "node-7", "node-8", "node-5"},
},

The node-5, 6, 7, 8, 3 can be iterated.

Forgive me if misunderstand something here.

yes, it was on purpose, based on what was there, but I can see how this can be cryptic, so better make it so that the append is behaving in a clearer way. Thanks for the patch.

How far back do you believe this bug was present? 1.17? 1.16? I've just seen the exact same problem in 1.17 on AWS and restarting the unscheduled node fixed the problem.

@judgeaxl could you provide more details? Log lines, cache dumps, etc. So we can determine whether the issue is the same.

As I noted in https://github.com/kubernetes/kubernetes/issues/91601#issuecomment-662746695, I believe this bug was present in older versions, but my thinking is that it's transient.

@maelk would you be able to investigate?

Please also share the distribution of nodes in the zones.

@alculquicondor unfortunately I can't at this point. Sorry.

@alculquicondor sorry, I already rebuilt the cluster for other reasons, but it may have been a network configuration problem related to multi-az deployments, and in what subnet the faulty node got launched, so I wouldn't worry about it for now in the context of this issue. If I notice it again I'll report back with better details. Thanks!

/retitle Some nodes are not considered in scheduling when there is zone imbalance

Was this page helpful?
0 / 5 - 0 ratings