Kubernetes: `failed to garbage collect required amount of images. Wanted to free 473842483 bytes, but freed 0 bytes`

Created on 8 Dec 2018  ·  30Comments  ·  Source: kubernetes/kubernetes

What happened: I've been seeing a number of evictions recently that appear to be due to disk pressure:

$$$ kubectl get pod kumo-go-api-d46f56779-jl6s2 --namespace=kumo-main -o yaml
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: 2018-12-06T10:05:25Z
  generateName: kumo-go-api-d46f56779-
  labels:
    io.kompose.service: kumo-go-api
    pod-template-hash: "802912335"
  name: kumo-go-api-d46f56779-jl6s2
  namespace: kumo-main
  ownerReferences:
  - apiVersion: extensions/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: kumo-go-api-d46f56779
    uid: c0a9355e-f780-11e8-b336-42010aa80057
  resourceVersion: "11617978"
  selfLink: /api/v1/namespaces/kumo-main/pods/kumo-go-api-d46f56779-jl6s2
  uid: 7337e854-f93e-11e8-b336-42010aa80057
spec:
  containers:
  - env:
    - redacted...
    image: gcr.io/<redacted>/kumo-go-api@sha256:c6a94fc1ffeb09ea6d967f9ab14b9a26304fa4d71c5798acbfba5e98125b81da
    imagePullPolicy: Always
    name: kumo-go-api
    ports:
    - containerPort: 5000
      protocol: TCP
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-t6jkx
      readOnly: true
  dnsPolicy: ClusterFirst
  nodeName: gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: default-token-t6jkx
    secret:
      defaultMode: 420
      secretName: default-token-t6jkx
status:
  message: 'The node was low on resource: nodefs.'
  phase: Failed
  reason: Evicted
  startTime: 2018-12-06T10:05:25Z

Taking a look at kubectl get events, I see these warnings:

$$$ kubectl get events
LAST SEEN   FIRST SEEN   COUNT     NAME                                                                   KIND      SUBOBJECT   TYPE      REASON          SOURCE                                                         MESSAGE
2m          13h          152       gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s.156e07f40b90ed91   Node                  Warning   ImageGCFailed   kubelet, gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s   (combined from similar events): failed to garbage collect required amount of images. Wanted to free 473948979 bytes, but freed 0 bytes
37m         37m          1         gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s.156e3127ebc715c3   Node                  Warning   ImageGCFailed   kubelet, gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s   failed to garbage collect required amount of images. Wanted to free 473674547 bytes, but freed 0 bytes

Digging a bit deeper:

$$$ kubectl get event gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s.156e07f40b90ed91 -o yaml
apiVersion: v1
count: 153
eventTime: null
firstTimestamp: 2018-12-07T11:01:06Z
involvedObject:
  kind: Node
  name: gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s
  uid: gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s
kind: Event
lastTimestamp: 2018-12-08T00:16:09Z
message: '(combined from similar events): failed to garbage collect required amount
  of images. Wanted to free 474006323 bytes, but freed 0 bytes'
metadata:
  creationTimestamp: 2018-12-07T11:01:07Z
  name: gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s.156e07f40b90ed91
  namespace: default
  resourceVersion: "381976"
  selfLink: /api/v1/namespaces/default/events/gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s.156e07f40b90ed91
  uid: 65916e4b-fa0f-11e8-ae9a-42010aa80058
reason: ImageGCFailed
reportingComponent: ""
reportingInstance: ""
source:
  component: kubelet
  host: gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s
type: Warning

There's actually remarkably little here. This message doesn't say anything regarding why ImageGC was initiated or why it was unable recover more space.

What you expected to happen: Image GC to work correctly, or at least fail to schedule pods onto nodes that do not have sufficient disk space.

How to reproduce it (as minimally and precisely as possible): Run and stop as many pods as possible on a node in order to encourage disk pressure. Then observe these errors.

Anything else we need to know?: n/a

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.7", GitCommit:"0c38c362511b20a098d7cd855f1314dad92c2780", GitTreeState:"clean", BuildDate:"2018-08-20T10:09:03Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"10+", GitVersion:"v1.10.7-gke.11", GitCommit:"fa90543563c9cfafca69128ce8cd9ecd5941940f", GitTreeState:"clean", BuildDate:"2018-11-08T20:22:21Z", GoVersion:"go1.9.3b4", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: GKE
  • OS (e.g. from /etc/os-release): I'm running macOS 10.14, nodes are running Container-Optimized OS (cos).
  • Kernel (e.g. uname -a): Darwin D-10-19-169-80.dhcp4.washington.edu 18.0.0 Darwin Kernel Version 18.0.0: Wed Aug 22 20:13:40 PDT 2018; root:xnu-4903.201.2~1/RELEASE_X86_64 x86_64
  • Install tools: n/a
  • Others: n/a


/kind bug

areprovidegcp kinbug sinode

Most helpful comment

Faced the same problem.

kubectl drain --delete-local-data --ignore-daemonsets $NODE_IP && kubectl uncordon $NODE_IP was enough to clean the disk storage.

All 30 comments

/sig gcp

I just upgraded my master version and nodes to 1.11.3-gke.18 to see if that would be any help, but I'm still seeing the exact same thing.

FWIW "Boot disk size in GB (per node)" was set to the minimum, 10 Gb.

@samuela any update on the issue ? I see the same problem.

@hgokavarapuz No update as far as I've heard. Def seems like a serious issue for GKE.

@samuela I saw this issue on AWS but was able to work around by using a different AMI. Have to check what is the difference in the AMI though that it causes it to happen.

@hgokavarapuz Interesting... maybe this has something to do with the node OS/setup then.

Have to debug more what exactly causes this issue though.

On Wed, Dec 12, 2018 at 1:23 PM samuela notifications@github.com wrote:

@hgokavarapuz https://github.com/hgokavarapuz Interesting... maybe this
has something to do with the node OS/setup then.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/issues/71869#issuecomment-446748663,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AmWWLdQjFnWgM5jeutfY6YqJBQ9l2l8gks5u4XO2gaJpZM4ZJWSq
.

--
Thank you
Hemanth

@hgokavarapuz check the kubelet logs for clues

I was able to fix mine it was an issue with the AMI that I was using which has /var folder mounted to an EBS volume with some restricted size, causing the problem with Docker containers creation. Was not directly obvious from the logs but checking the space and other things made it clear.

@hgokavarapuz Are you sure that this actually fixes the problem and doesn't just require more image downloads for the bug to occur?

In my case this was happening within the GKE allowed disk sizes, so I'd say there's definitely still some sort of bug in GKE here at least.

It would also be good to have some sort of official position on the minimum disk size required in order to run kubernetes on a node without getting this error. Otherwise it's not clear exactly how large the volumes must be in order to be within spec for running kubernetes.

@samuela I haven't tried on GKE but that was the issue on the AWS with some of the AMI's. Maybe that there is an issue with GKE.

We're hitting something similar on GKE v1.11.5-gke.4. There seems to be some issue with GC not keeping up, as seen by the following events:

Events:
  Type     Reason                 Age                 From                                               Message
  ----     ------                 ----                ----                                               -------
  Warning  FreeDiskSpaceFailed    47m                 kubelet, gke-v11-service-graph-pool-c6e93d11-k6h6  failed to garbage collect required amount of images. Wanted to free 758374400 bytes, but freed 375372075 bytes
  Warning  FreeDiskSpaceFailed    42m                 kubelet, gke-v11-service-graph-pool-c6e93d11-k6h6  failed to garbage collect required amount of images. Wanted to free 898760704 bytes, but freed 0 bytes
  Warning  ImageGCFailed          42m                 kubelet, gke-v11-service-graph-pool-c6e93d11-k6h6  failed to garbage collect required amount of images. Wanted to free 898760704 bytes, but freed 0 bytes
  Normal   NodeHasDiskPressure    37m                 kubelet, gke-v11-service-graph-pool-c6e93d11-k6h6  Node gke-v11-service-graph-pool-c6e93d11-k6h6 status is now: NodeHasDiskPressure
  Warning  FreeDiskSpaceFailed    37m                 kubelet, gke-v11-service-graph-pool-c6e93d11-k6h6  failed to garbage collect required amount of images. Wanted to free 1430749184 bytes, but freed 0 bytes
  Warning  ImageGCFailed          37m                 kubelet, gke-v11-service-graph-pool-c6e93d11-k6h6  failed to garbage collect required amount of images. Wanted to free 1430749184 bytes, but freed 0 bytes
  Warning  EvictionThresholdMet   36m (x21 over 37m)  kubelet, gke-v11-service-graph-pool-c6e93d11-k6h6  Attempting to reclaim ephemeral-storage
  Warning  ImageGCFailed          32m                 kubelet, gke-v11-service-graph-pool-c6e93d11-k6h6  failed to garbage collect required amount of images. Wanted to free 1109360640 bytes, but freed 0 bytes
  Warning  FreeDiskSpaceFailed    27m                 kubelet, gke-v11-service-graph-pool-c6e93d11-k6h6  failed to garbage collect required amount of images. Wanted to free 1367126016 bytes, but freed 0 bytes
  Warning  ImageGCFailed          22m                 kubelet, gke-v11-service-graph-pool-c6e93d11-k6h6  failed to garbage collect required amount of images. Wanted to free 1885589504 bytes, but freed 0 bytes
  Warning  FreeDiskSpaceFailed    17m                 kubelet, gke-v11-service-graph-pool-c6e93d11-k6h6  failed to garbage collect required amount of images. Wanted to free 2438008832 bytes, but freed 0 bytes
  Warning  FreeDiskSpaceFailed    12m                 kubelet, gke-v11-service-graph-pool-c6e93d11-k6h6  failed to garbage collect required amount of images. Wanted to free 2223022080 bytes, but freed 0 bytes
  Warning  ImageGCFailed          7m                  kubelet, gke-v11-service-graph-pool-c6e93d11-k6h6  failed to garbage collect required amount of images. Wanted to free 2358378496 bytes, but freed 0 bytes
  Normal   NodeHasNoDiskPressure  2m (x4 over 4h)     kubelet, gke-v11-service-graph-pool-c6e93d11-k6h6  Node gke-v11-service-graph-pool-c6e93d11-k6h6 status is now: NodeHasNoDiskPressure

Scanning the kubelet logs, I see the following entries:

Feb 07 21:15:31 gke-v11-service-graph-pool-c6e93d11-k6h6 kubelet[1594]: I0207 21:15:31.447179    1594 image_gc_manager.go:300] [imageGCManager]: Disk usage on image filesystem is at 99% which is over the high threshold (85%). Trying to free 2358378496 byte
Feb 07 21:15:31 gke-v11-service-graph-pool-c6e93d11-k6h6 kubelet[1594]: E0207 21:15:31.452366    1594 kubelet.go:1253] Image garbage collection failed multiple times in a row: failed to garbage collect required amount of images. Wanted to free 2358378496 b
Feb 07 21:15:31 gke-v11-service-graph-pool-c6e93d11-k6h6 kubelet[1594]: I0207 21:15:31.711566    1594 kuberuntime_manager.go:513] Container {Name:metadata-agent Image:gcr.io/stackdriver-agents/stackdriver-metadata-agent:0.2-0.0.21-1 Command:[] Args:[-o Kub
Feb 07 21:15:32 gke-v11-service-graph-pool-c6e93d11-k6h6 kubelet[1594]: I0207 21:15:32.004882    1594 cloud_request_manager.go:89] Requesting node addresses from cloud provider for node "gke-v11-service-graph-pool-c6e93d11-k6h6"
Feb 07 21:15:32 gke-v11-service-graph-pool-c6e93d11-k6h6 kubelet[1594]: I0207 21:15:32.008529    1594 cloud_request_manager.go:108] Node addresses from cloud provider for node "gke-v11-service-graph-pool-c6e93d11-k6h6" collected
Feb 07 21:15:34 gke-v11-service-graph-pool-c6e93d11-k6h6 kubelet[1594]: I0207 21:15:34.817530    1594 kube_docker_client.go:348] Stop pulling image "gcr.io/stackdriver-agents/stackdriver-logging-agent:0.8-1.6.2-1": "e807eb07af89: Extracting [==============
Feb 07 21:15:34 gke-v11-service-graph-pool-c6e93d11-k6h6 kubelet[1594]: E0207 21:15:34.817616    1594 remote_image.go:108] PullImage "gcr.io/stackdriver-agents/stackdriver-logging-agent:0.8-1.6.2-1" from image service failed: rpc error: code = Unknown desc
Feb 07 21:15:34 gke-v11-service-graph-pool-c6e93d11-k6h6 kubelet[1594]: E0207 21:15:34.817823    1594 kuberuntime_manager.go:733] container start failed: ErrImagePull: rpc error: code = Unknown desc = failed to register layer: Error processing tar file(exi
Feb 07 21:15:35 gke-v11-service-graph-pool-c6e93d11-k6h6 kubelet[1594]: W0207 21:15:35.057924    1594 kubelet_getters.go:264] Path "/var/lib/kubelet/pods/652e958e-2b1d-11e9-827c-42010a800fdc/volumes" does not exist
Feb 07 21:15:35 gke-v11-service-graph-pool-c6e93d11-k6h6 kubelet[1594]: I0207 21:15:35.058035    1594 eviction_manager.go:400] eviction manager: pods fluentd-gcp-v3.1.1-spdfd_kube-system(652e958e-2b1d-11e9-827c-42010a800fdc) successfully cleaned up
Feb 07 21:15:35 gke-v11-service-graph-pool-c6e93d11-k6h6 kubelet[1594]: E0207 21:15:35.091740    1594 pod_workers.go:186] Error syncing pod 7e06145a-2b1d-11e9-827c-42010a800fdc ("fluentd-gcp-v3.1.1-bgdg6_kube-system(7e06145a-2b1d-11e9-827c-42010a800fdc)"),
Feb 07 21:15:35 gke-v11-service-graph-pool-c6e93d11-k6h6 kubelet[1594]: W0207 21:15:35.179545    1594 eviction_manager.go:329] eviction manager: attempting to reclaim ephemeral-storage

It seems like something is holding the GC to reclaim the storage fast enough. The node looks like it eventually recovers, but some pods get evicted in the process.

I am encountering the same issue. I deployed the stack with kops on AWS and my k8s version is 1.11.6. The problem is that i have an application downtime ones per week when the disk pressure happened.

same issue here. I extended the ebs volumes thinking that would fix it.
using
ami k8s-1.10-debian-jessie-amd64-hvm-ebs-2018-08-17 (ami-009b9699070ffc46f)

I faced similar issue but on AKS. When we scale down cluster with az cli and then up I would think that new nodes are clean, I mean without any garbage but

$ kubectl get no
NAME                       STATUS   ROLES   AGE   VERSION
aks-agentpool-11344223-0   Ready    agent   77d   v1.12.4
aks-agentpool-11344223-1   Ready    agent   9h    v1.12.4
aks-agentpool-11344223-2   Ready    agent   9h    v1.12.4
aks-agentpool-11344223-3   Ready    agent   9h    v1.12.4
aks-agentpool-11344223-4   Ready    agent   9h    v1.12.4
aks-agentpool-11344223-5   Ready    agent   9h    v1.12.4

and when I ssh into one of them I can see plenty of old images like

$ docker images | grep addon-resizer
k8s.gcr.io/addon-resizer                               1.8.4               5ec630648120        6 months ago        38.3MB
k8s.gcr.io/addon-resizer                               1.8.1               6c0dbeaa8d20        17 months ago       33MB
k8s.gcr.io/addon-resizer                               1.7                 9b0815c87118        2 years ago         39MB

or

$ docker images | grep k8s.gcr.io/cluster-autoscaler
k8s.gcr.io/cluster-autoscaler                          v1.14.0             ef6c40006faf        7 weeks ago         142MB
k8s.gcr.io/cluster-autoscaler                          v1.13.2             0f47d27d8e0d        2 months ago        137MB
k8s.gcr.io/cluster-autoscaler                          v1.12.3             9119261ec106        2 months ago        232MB
k8s.gcr.io/cluster-autoscaler                          v1.3.7              c711df426ac6        2 months ago        217MB
k8s.gcr.io/cluster-autoscaler                          v1.12.2             d67faca6c0aa        3 months ago        232MB
k8s.gcr.io/cluster-autoscaler                          v1.13.1             39c073d73c1e        5 months ago        137MB
k8s.gcr.io/cluster-autoscaler                          v1.3.4              6168be341178        6 months ago        217MB
k8s.gcr.io/cluster-autoscaler                          v1.3.3              bd9362bb17a5        7 months ago        217MB
k8s.gcr.io/cluster-autoscaler                          v1.2.2              2378f4474aa3        11 months ago       209MB
k8s.gcr.io/cluster-autoscaler                          v1.1.2              e137f4b4d451        14 months ago       198MB

which is crazy as this as I see plenty of below errors

  Type     Reason               Age    From                               Message
  ----     ------               ----   ----                               -------
  Warning  FreeDiskSpaceFailed  15m    kubelet, aks-agentpool-11344223-5  failed to garbage collect required amount of images. Wanted to free 1297139302 bytes, but freed 0 bytes
  Warning  FreeDiskSpaceFailed  10m    kubelet, aks-agentpool-11344223-5  failed to garbage collect required amount of images. Wanted to free 1447237222 bytes, but freed 0 bytes
  Warning  ImageGCFailed        10m    kubelet, aks-agentpool-11344223-5  failed to garbage collect required amount of images. Wanted to free 1447237222 bytes, but freed 0 bytes

@samuela: There are no sig labels on this issue. Please add a sig label by either:

  1. mentioning a sig: @kubernetes/sig-<group-name>-<group-suffix>
    e.g., @kubernetes/sig-contributor-experience-<group-suffix> to notify the contributor experience sig, OR

  2. specifying the label manually: /sig <group-name>
    e.g., /sig scalability to apply the sig/scalability label

Note: Method 1 will trigger an email to the group. See the group list.
The <group-suffix> in method 1 has to be replaced with one of these: _bugs, feature-requests, pr-reviews, test-failures, proposals_.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

I am hitting this on Openstack using v1.11.10

Node is completely out of disk space and kubelet logs are now a loop of:

E1029 06:41:37.397348    8907 remote_runtime.go:278] ContainerStatus "redacted" from runtime service failed: rpc error: code = Unknown desc = unable to inspect docker image "sha256:redacted" while inspecting docker container "redacted": no such image: "sha256:redacted"
Oct 29 06:41:37 node-name bash[8907]: E1029 06:41:37.397378    8907 kuberuntime_container.go:391] ContainerStatus for redacted error: rpc error: code = Unknown desc = unable to inspect docker image "sha256:redacted" while inspecting docker container "redacted": no such image: "sha256:redacted"
Oct 29 06:41:37 node-name bash[8907]: E1029 06:41:37.397388    8907 kuberuntime_manager.go:873] getPodContainerStatuses for pod "coredns-49t6c_kube-system(redacted)" failed: rpc error: code = Unknown desc = unable to inspect docker image "sha256:redacted" while inspecting docker container "redacted": no such image: "sha256:redacted"
Oct 29 06:41:37 node-name bash[8907]: E1029 06:41:37.397404    8907 generic.go:241] PLEG: Ignoring events for pod coredns-49t6c/kube-system: rpc error: code = Unknown desc = unable to inspect docker image "sha256:redacted" while inspecting docker container "redacted": no such image: "sha256:redacted"

The issue for me was caused by a container taking a lot of disk space in a short amount of time. This happened in multiple nodes. The container was evicted (every pod in the node was), but the disk was not reclaimed by kubelet.

I had to du /var/lib/docker/overlay -h | sort -h in order to find which containers were doing this and manually delete them. This brought the nodes out of Disk Pressure and they recovered (one of them needed a reboot -f).

This is happening for me too. I have 8 nodes in an EKS cluster, and for some reason only one node is having this GC issue. This has happened twice, and the below steps are what I've done to fix the issue. Does anyone know of a better / supported method for doing this? https://kubernetes.io/docs/tasks/administer-cluster/cluster-management/#maintenance-on-a-node

  1. Increase autoscaling group for EKS by +1 (replacement node for the bad one)
  2. Cordon off the bad node (kubectl cordon)
  3. Drain the bad node (kubectl drain), to kick the pods off this node and onto one of the other nodes
  4. Add downscaling protection all nodes except the bad node
  5. Decrease the autoscaling group for EKS by -1 (this deletes the bad node, since it's the only one not protected)
  6. Remove downscaling protection from all nodes

Faced the same problem.

kubectl drain --delete-local-data --ignore-daemonsets $NODE_IP && kubectl uncordon $NODE_IP was enough to clean the disk storage.

FWIW "Boot disk size in GB (per node)" was set to the minimum, 10 Gb.

Thank you very much. It worked with me

/sig node

@HayTran94 @samuela @KIVagant @dattim
realImageGCManager#freeSpace has log at level 5 if certain image is not eligible for GC.
e.g.

        if image.lastUsed.Equal(freeTime) || image.lastUsed.After(freeTime) {
            klog.V(5).Infof("Image ID %s has lastUsed=%v which is >= freeTime=%v, not eligible for garbage collection", image.id, image.lastUsed, freeTime)
            continue

Can you put log level to 5 and see if there is some clue given by realImageGCManager#freeSpace ?

Thanks

@rubencabrera
In the log you posted:

no such image: "sha256:redacted"

Did you have a chance to verify whether the underlying image existed or not ?

Thanks

Please keep me out of this loop.
Not sure why I am copied in this email

Thanks & Regards,
Ashutosh Singh

On Mon, Apr 13, 2020, 00:21 Zhihong Yu notifications@github.com wrote:

@rubencabrera https://github.com/rubencabrera
In the log you posted:

no such image: "sha256:redacted"

Did you have a chance to verify whether the underlying image existed or
not ?

Thanks


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/issues/71869#issuecomment-612684868,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ADS6CKHTR2QTDJOWNKMLX23RMI5FXANCNFSM4GJFMSVA
.

@rubencabrera
In the log you posted:

no such image: "sha256:redacted"

Did you have a chance to verify whether the underlying image existed or not ?

Thanks

Hi, @tedyu

Yes, I checked that, we use some private repositories and images not available is a frequent problem, so it was my first thought when seeing that error. The image was available and running in other nodes of the same cluster.

Has anyone figured out a way to persuade k8s garbage collection to fire on a disk which isn't the root filesystem? We must use a secondary (SSD) disk for /var/lib/docker to address EKS performance issues (see https://github.com/awslabs/amazon-eks-ami/issues/454 ). But the garbage collection doesn't fire and we sometimes overflow that secondary disk.

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

We've started suffering this issue in the last week. Kubernetes 1.17.9, built with Kops 1.17.1, self-hosted in AWS, using the k8s-1.17-debian-stretch-amd64-hvm-ebs-2020-01-17 AMI, docker 19.03.11.

This has happened on two separate Nodes within the last week, both of which presented with this:

Events:
  Type     Reason               Age                  From                                                Message
  ----     ------               ----                 ----                                                -------
  Warning  FreeDiskSpaceFailed  10m (x204 over 17h)  kubelet, ip-10-224-54-0.us-west-2.compute.internal  (combined from similar events): failed to garbage collect required amount of images. Wanted to free 5877565849 bytes, but freed 101485977 bytes
  Warning  ImageGCFailed        18s (x205 over 17h)  kubelet, ip-10-224-54-0.us-west-2.compute.internal  (combined from similar events): failed to garbage collect required amount of images. Wanted to free 5886654873 bytes, but freed 0 bytes

du and df on the Node don't agree on how much space is used:

admin@ip-10-224-54-0:~$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p2   57G   48G  5.8G  90% /

admin@ip-10-224-54-0:~$ sudo du -sh /
du: cannot access '/proc/9856/task/9856/fd/3': No such file or directory
du: cannot access '/proc/9856/task/9856/fdinfo/3': No such file or directory
du: cannot access '/proc/9856/fd/4': No such file or directory
du: cannot access '/proc/9856/fdinfo/4': No such file or directory
11G     /

admin@ip-10-224-54-0:~$ sudo du -sh --one-file-system /
6.6G    /

Mounting the root device to another mountpoint to get rid of the other mounted filesystems gets du to agree consistently on the used space, but df still disagrees:

admin@ip-10-224-54-0:~$ mkdir tmproot
admin@ip-10-224-54-0:~$ sudo mount /dev/nvme0n1p2 /home/admin/tmproot
admin@ip-10-224-54-0:~$ df -h tmproot/
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p2   57G   48G  5.8G  90% /home/admin/tmproot
admin@ip-10-224-54-0:~$ sudo du -sh tmproot/
6.6G    tmproot/
admin@ip-10-224-54-0:~$ sudo du -sh --one-file-system tmproot/
6.6G    tmproot/

I think this can be due to processes holding open deleted files. But restarting kubelet doesn't free up this space, and it would be the process I'd suspect of causing this. Restarting docker also didn't free up the space.

The first time this happened, I ended up terminating the Node after several hours of fruitless investigation, but now that it's happening again, I can't make that the permanent solution to the issue.

Interesting datapoint: containerd has deleted files open:

admin@ip-10-224-54-0:~$ sudo lsof 2>&1| grep -v "no pwd entry" |  grep deleted
container 12469           root  cwd       DIR               0,19        40 1180407868 /run/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12 (deleted)
container 12469           root    4u     FIFO              259,2       0t0    2097336 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stdout.log (deleted)
container 12469           root    6u     FIFO              259,2       0t0    2097336 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stdout.log (deleted)
container 12469           root    7u     FIFO              259,2       0t0    2097337 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stderr.log (deleted)
container 12469           root    8u     FIFO              259,2       0t0    2097337 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stderr.log (deleted)
container 12469 12470     root  cwd       DIR               0,19        40 1180407868 /run/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12 (deleted)
container 12469 12470     root    4u     FIFO              259,2       0t0    2097336 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stdout.log (deleted)
container 12469 12470     root    6u     FIFO              259,2       0t0    2097336 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stdout.log (deleted)
container 12469 12470     root    7u     FIFO              259,2       0t0    2097337 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stderr.log (deleted)
container 12469 12470     root    8u     FIFO              259,2       0t0    2097337 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stderr.log (deleted)
container 12469 12471     root  cwd       DIR               0,19        40 1180407868 /run/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12 (deleted)
container 12469 12471     root    4u     FIFO              259,2       0t0    2097336 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stdout.log (deleted)
container 12469 12471     root    6u     FIFO              259,2       0t0    2097336 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stdout.log (deleted)
container 12469 12471     root    7u     FIFO              259,2       0t0    2097337 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stderr.log (deleted)
container 12469 12471     root    8u     FIFO              259,2       0t0    2097337 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stderr.log (deleted)
container 12469 12472     root  cwd       DIR               0,19        40 1180407868 /run/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12 (deleted)
container 12469 12472     root    4u     FIFO              259,2       0t0    2097336 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stdout.log (deleted)
container 12469 12472     root    6u     FIFO              259,2       0t0    2097336 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stdout.log (deleted)
container 12469 12472     root    7u     FIFO              259,2       0t0    2097337 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stderr.log (deleted)
container 12469 12472     root    8u     FIFO              259,2       0t0    2097337 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stderr.log (deleted)
container 12469 12473     root  cwd       DIR               0,19        40 1180407868 /run/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12 (deleted)
container 12469 12473     root    4u     FIFO              259,2       0t0    2097336 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stdout.log (deleted)
container 12469 12473     root    6u     FIFO              259,2       0t0    2097336 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stdout.log (deleted)
container 12469 12473     root    7u     FIFO              259,2       0t0    2097337 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stderr.log (deleted)
container 12469 12473     root    8u     FIFO              259,2       0t0    2097337 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stderr.log (deleted)
container 12469 12474     root  cwd       DIR               0,19        40 1180407868 /run/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12 (deleted)
container 12469 12474     root    4u     FIFO              259,2       0t0    2097336 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stdout.log (deleted)
container 12469 12474     root    6u     FIFO              259,2       0t0    2097336 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stdout.log (deleted)
container 12469 12474     root    7u     FIFO              259,2       0t0    2097337 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stderr.log (deleted)
container 12469 12474     root    8u     FIFO              259,2       0t0    2097337 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stderr.log (deleted)
container 12469 12475     root  cwd       DIR               0,19        40 1180407868 /run/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12 (deleted)
container 12469 12475     root    4u     FIFO              259,2       0t0    2097336 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stdout.log (deleted)
container 12469 12475     root    6u     FIFO              259,2       0t0    2097336 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stdout.log (deleted)
container 12469 12475     root    7u     FIFO              259,2       0t0    2097337 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stderr.log (deleted)
container 12469 12475     root    8u     FIFO              259,2       0t0    2097337 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stderr.log (deleted)
container 12469 12476     root  cwd       DIR               0,19        40 1180407868 /run/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12 (deleted)
container 12469 12476     root    4u     FIFO              259,2       0t0    2097336 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stdout.log (deleted)
container 12469 12476     root    6u     FIFO              259,2       0t0    2097336 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stdout.log (deleted)
container 12469 12476     root    7u     FIFO              259,2       0t0    2097337 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stderr.log (deleted)
container 12469 12476     root    8u     FIFO              259,2       0t0    2097337 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stderr.log (deleted)
container 12469 12477     root  cwd       DIR               0,19        40 1180407868 /run/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12 (deleted)
container 12469 12477     root    4u     FIFO              259,2       0t0    2097336 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stdout.log (deleted)
container 12469 12477     root    6u     FIFO              259,2       0t0    2097336 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stdout.log (deleted)
container 12469 12477     root    7u     FIFO              259,2       0t0    2097337 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stderr.log (deleted)
container 12469 12477     root    8u     FIFO              259,2       0t0    2097337 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stderr.log (deleted)
container 12469 19325     root  cwd       DIR               0,19        40 1180407868 /run/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12 (deleted)
container 12469 19325     root    4u     FIFO              259,2       0t0    2097336 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stdout.log (deleted)
container 12469 19325     root    6u     FIFO              259,2       0t0    2097336 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stdout.log (deleted)
container 12469 19325     root    7u     FIFO              259,2       0t0    2097337 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stderr.log (deleted)
container 12469 19325     root    8u     FIFO              259,2       0t0    2097337 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stderr.log (deleted)

Restarting containerd.service also didn't free the space or get rid of these filehandles.

Was this page helpful?
0 / 5 - 0 ratings