Kubernetes: ignored pod-eviction-timeout settings

Created on 27 Feb 2019 · 15Comments · Source: kubernetes/kubernetes

What happened: I modified the pod-eviction-timeout settings of kube-controller-manager on the master node (in order to to decrease the amount of time before k8s re-creates a pod in case of node failure). The default value is 5 minutes, I configured 30 seconds. Using the sudo docker ps --no-trunc | grep "kube-controller-manager" command I checked that the modification was successfully applied:

kubeadmin@nodetest21:~$ sudo docker ps --no-trunc | grep "kube-controller-manager"
387261c61ee9cebce50de2540e90b89e2bc710b4126a0c066ef41f0a1fb7cf38   sha256:0482f640093306a4de7073fde478cf3ca877b6fcc2c4957624dddb2d304daef5                         "kube-controller-manager --address=127.0.0.1 --authentication-kubeconfig=/etc/kubernetes/controller-manager.conf --authorization-kubeconfig=/etc/kubernetes/controller-manager.conf --client-ca-file=/etc/kubernetes/pki/ca.crt --cluster-signing-cert-file=/etc/kubernetes/pki/ca.crt --cluster-signing-key-file=/etc/kubernetes/pki/ca.key --controllers=*,bootstrapsigner,tokencleaner --kubeconfig=/etc/kubernetes/controller-manager.conf --leader-elect=true --requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.crt --root-ca-file=/etc/kubernetes/pki/ca.crt --service-account-private-key-file=/etc/kubernetes/pki/sa.key --use-service-account-credentials=true --pod-eviction-timeout=30s"

I applied a basic deployment with two replicas:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: busybox
  namespace: default
spec:
  replicas: 2
  selector:
    matchLabels:
      app: busybox
  template:
    metadata:
      labels:
        app: busybox
    spec:
      containers:
      - image: busybox
        command:
        - sleep
        - "3600"
        imagePullPolicy: IfNotPresent
        name: busybox
      restartPolicy: Always

The first pod created on the first worker node, the second pod created on the second worker node:

NAME         STATUS   ROLES    AGE   VERSION
nodetest21   Ready    master   34m   v1.13.3
nodetest22   Ready    <none>   31m   v1.13.3
nodetest23   Ready    <none>   30m   v1.13.3

NAMESPACE     NAME                                 READY   STATUS    RESTARTS   AGE   IP          NODE         NOMINATED NODE   READINESS GATES
default       busybox-74b487c57b-5s6g7             1/1     Running   0          13s   10.44.0.2   nodetest22   <none>           <none>
default       busybox-74b487c57b-6zdvv             1/1     Running   0          13s   10.36.0.1   nodetest23   <none>           <none>
kube-system   coredns-86c58d9df4-gmcjd             1/1     Running   0          34m   10.32.0.2   nodetest21   <none>           <none>
kube-system   coredns-86c58d9df4-wpffr             1/1     Running   0          34m   10.32.0.3   nodetest21   <none>           <none>
kube-system   etcd-nodetest21                      1/1     Running   0          33m   10.0.1.4    nodetest21   <none>           <none>
kube-system   kube-apiserver-nodetest21            1/1     Running   0          33m   10.0.1.4    nodetest21   <none>           <none>
kube-system   kube-controller-manager-nodetest21   1/1     Running   0          20m   10.0.1.4    nodetest21   <none>           <none>
kube-system   kube-proxy-6mcn8                     1/1     Running   1          31m   10.0.1.5    nodetest22   <none>           <none>
kube-system   kube-proxy-dhdqj                     1/1     Running   0          30m   10.0.1.6    nodetest23   <none>           <none>
kube-system   kube-proxy-vqjg8                     1/1     Running   0          34m   10.0.1.4    nodetest21   <none>           <none>
kube-system   kube-scheduler-nodetest21            1/1     Running   1          33m   10.0.1.4    nodetest21   <none>           <none>
kube-system   weave-net-9qls7                      2/2     Running   3          31m   10.0.1.5    nodetest22   <none>           <none>
kube-system   weave-net-h2cb6                      2/2     Running   0          33m   10.0.1.4    nodetest21   <none>           <none>
kube-system   weave-net-vkb62                      2/2     Running   0          30m   10.0.1.6    nodetest23   <none>           <none>

To test the correct pod eviction I shutdown the first worker node. After ~1 min the status of the first worker node changed to "NotReady", then
I had to wait +5 minutes (which is the default pod eviction timeout) for pod on the turned off node to be re-created on the other node.

What you expected to happen:
After the node status reports "NotReady", the pod should be re-created on the other node after 30 seconds instead if the default 5 minutes!

How to reproduce it (as minimally and precisely as possible):
Create three nodes. Init Kubernetes on the first node (sudo kubeadm init), apply network plugin (kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"), then join the other two nodes (like: kubeadm join 10.0.1.4:6443 --token xdx9y1.z7jc0j7c8g8lpjog --discovery-token-ca-cert-hash sha256:04ae8388f607755c14eed702a23fd47802d5512e092b08add57040a2ae0736ac).
Add pod-eviction-timeout parameter to Kube Controller Manager on the master node: sudo vi /etc/kubernetes/manifests/kube-controller-manager.yaml:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    scheduler.alpha.kubernetes.io/critical-pod: ""
  creationTimestamp: null
  labels:
    component: kube-controller-manager
    tier: control-plane
  name: kube-controller-manager
  namespace: kube-system
spec:
  containers:
  - command:
    - kube-controller-manager
    - --address=127.0.0.1
    - --authentication-kubeconfig=/etc/kubernetes/controller-manager.conf
    - --authorization-kubeconfig=/etc/kubernetes/controller-manager.conf
    - --client-ca-file=/etc/kubernetes/pki/ca.crt
    - --cluster-signing-cert-file=/etc/kubernetes/pki/ca.crt
    - --cluster-signing-key-file=/etc/kubernetes/pki/ca.key
    - --controllers=*,bootstrapsigner,tokencleaner
    - --kubeconfig=/etc/kubernetes/controller-manager.conf
    - --leader-elect=true
    - --requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.crt
    - --root-ca-file=/etc/kubernetes/pki/ca.crt
    - --service-account-private-key-file=/etc/kubernetes/pki/sa.key
    - --use-service-account-credentials=true
    - --pod-eviction-timeout=30s

(the yaml is truncated, only the related first part is showed here).

Check that the settings is applied:
sudo docker ps --no-trunc | grep "kube-controller-manager"

Apply a deployment with two replicas, check that one pod is created on first worker node, the second is created on the second worker node.
Shut down one of the nodes, and check the elapsed time between the event, when the node reports "NotReady" and the pod re-created.

Anything else we need to know?:
I experience the same issue in multi-master environment also.

Environment:

Kubernetes version (use kubectl version): v1.13.3
Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.3", GitCommit:"721bfa751924da8d1680787490c54b9179b1fed0", GitTreeState:"clean", BuildDate:"2019-02-01T20:08:12Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.3", GitCommit:"721bfa751924da8d1680787490c54b9179b1fed0", GitTreeState:"clean", BuildDate:"2019-02-01T20:00:57Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider or hardware configuration: Azure VM
OS (e.g: cat /etc/os-release): NAME="Ubuntu" VERSION="16.04.5 LTS (Xenial Xerus)"
Kernel (e.g. uname -a): Linux nodetest21 4.15.0-1037-azure #39~16.04.1-Ubuntu SMP Tue Jan 15 17:20:47 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Install tools:
Others: Docker v18.06.1-ce

kinbug siapps sinode

Source

danielloczi

Most helpful comment

Thanks for your feedback ChiefAlexander!
That is the situation, you wrote. I checked the pods, and sure there are the default values assigned to pod for toleration:

kubectl describe pod busybox-74b487c57b-95b6n | grep -i toleration -A 2
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s

So I just simply added my own values to the deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: busybox
  namespace: default
spec:
  replicas: 2
  selector:
    matchLabels:
      app: busybox
  template:
    metadata:
      labels:
        app: busybox
    spec:
      tolerations:
      - key: "node.kubernetes.io/unreachable"
        operator: "Exists"
        effect: "NoExecute"
        tolerationSeconds: 2
      - key: "node.kubernetes.io/not-ready"
        operator: "Exists"
        effect: "NoExecute"
        tolerationSeconds: 2
      containers:
      - image: busybox
        command:
        - sleep
        - "3600"
        imagePullPolicy: IfNotPresent
        name: busybox
      restartPolicy: Always

After applying the deployment in case of node failure, node status changes to "NotReady", then pods re-created after 2 seconds.

So we don't have to deal with pod-eviction-timeout anymore, timeout can be set on Pod basis! Cool!

Thanks again for your help!

danielloczi on 28 Feb 2019

🎉6 🚀2 👍1

All 15 comments

@kubernetes/sig-node-bugs
@kubernetes/sig-apps-bugs

danielloczi on 27 Feb 2019

@danielloczi: Reiterating the mentions to trigger a notification:
@kubernetes/sig-node-bugs, @kubernetes/sig-apps-bugs

In response to this:

@kubernetes/sig-node-bugs
@kubernetes/sig-apps-bugs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot on 27 Feb 2019

I also ran into this issue while testing setting the eviction timeout lower. After poking around at this for sometime I figured out that the cause is the new TaintBasedEvictions.

In version 1.13, the TaintBasedEvictions feature is promoted to beta and enabled by default, hence the taints are automatically added by the NodeController (or kubelet) and the normal logic for evicting pods from nodes based on the Ready NodeCondition is disabled.

Setting the feature flag for this to false causes pods to be evicted like expected. I have not taken time to search through the taint based eviction code but I would guess that we are not utilizing this eviction timeout flag within it.

ChiefAlexander on 27 Feb 2019

👍6

Looking into this more. With TaintBasedEvictions set to true you can set your pods eviction time within its spec under tolerations:
https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/#taint-based-evictions
The default values of these are getting set by an admission controller: https://github.com/kubernetes/kubernetes/blob/master/plugin/pkg/admission/defaulttolerationseconds/admission.go#L34
Those two flags can be set via the kube-apiserver and should achieve the same effect.

ChiefAlexander on 27 Feb 2019

👍6

// Controller will not proactively sync node health, but will monitor node
// health signal updated from kubelet. There are 2 kinds of node healthiness
// signals: NodeStatus and NodeLease. NodeLease signal is generated only when
// NodeLease feature is enabled. If it doesn't receive update for this amount
// of time, it will start posting "NodeReady==ConditionUnknown". The amount of
// time before which Controller start evicting pods is controlled via flag
// 'pod-eviction-timeout'.
// Note: be cautious when changing the constant, it must work with
// nodeStatusUpdateFrequency in kubelet and renewInterval in NodeLease
// controller. The node health signal update frequency is the minimal of the
// two.
// There are several constraints:
// 1. nodeMonitorGracePeriod must be N times more than  the node health signal
//    update frequency, where N means number of retries allowed for kubelet to
//    post node status/lease. It is pointless to make nodeMonitorGracePeriod
//    be less than the node health signal update frequency, since there will
//    only be fresh values from Kubelet at an interval of node health signal
//    update frequency. The constant must be less than podEvictionTimeout.
// 2. nodeMonitorGracePeriod can't be too large for user experience - larger
//    value takes longer for user to see up-to-date node health.

liucimin on 28 Feb 2019

Thanks for your feedback ChiefAlexander!
That is the situation, you wrote. I checked the pods, and sure there are the default values assigned to pod for toleration:

kubectl describe pod busybox-74b487c57b-95b6n | grep -i toleration -A 2
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s

So I just simply added my own values to the deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: busybox
  namespace: default
spec:
  replicas: 2
  selector:
    matchLabels:
      app: busybox
  template:
    metadata:
      labels:
        app: busybox
    spec:
      tolerations:
      - key: "node.kubernetes.io/unreachable"
        operator: "Exists"
        effect: "NoExecute"
        tolerationSeconds: 2
      - key: "node.kubernetes.io/not-ready"
        operator: "Exists"
        effect: "NoExecute"
        tolerationSeconds: 2
      containers:
      - image: busybox
        command:
        - sleep
        - "3600"
        imagePullPolicy: IfNotPresent
        name: busybox
      restartPolicy: Always

After applying the deployment in case of node failure, node status changes to "NotReady", then pods re-created after 2 seconds.

So we don't have to deal with pod-eviction-timeout anymore, timeout can be set on Pod basis! Cool!

Thanks again for your help!

danielloczi on 28 Feb 2019

🎉6 🚀2 👍1

@danielloczi Hi danielloczi , How do you fix this issue? I also meet this issue

nick0323 on 10 Jun 2019

@323929 I think @danielloczi doesn't care about the pod-eviction-timeout parameter in kube-controller-manager, but solves it by using Taint based Evictions ， I tested with Taint based Evictions , it's worked for me.

zdyxry on 25 Jun 2019

That is right: I simply started to use Taint based Eviction.

danielloczi on 26 Jun 2019

Is it possible to make it global? I don't want to enable that for each pod config, especially that I use a lot of prepared things from helm

kamilgregorczyk on 2 Jan 2020

👍4

+1 for having the possibility to configure it per whole cluster. tuning per pod or per deployment is rarely useful: in most cases a sane global value is waaay more convenient and the current default of 5m is waaay to long for many cases.

please please reopen this issue.

morgwai on 8 Feb 2020

I am facing this same problem, Is there a way to unenable Taint based Evictions and that pod-eviction-timeout works in global mode?

richardqa on 7 Mar 2020

I am facing this same problem, Is there a way to unenable Taint based Evictions and that pod-eviction-timeout works in global mode?

I think that you can configure global pod eviction via apiserver: https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/
I didn't try this, but as I can see there are options: --default-not-ready-toleration-seconds and --default-unreachable-toleration-seconds.

hrbasic on 30 Mar 2020

👍1

Why had this bug been marked as closed? It does look like the original issue is not solved, but only work-arounded.
It is not clear to me why the pod-eviction-timeout flag is not working