报告错误时,请使用此模板,并提供尽可能多的信息。 不这样做可能会导致您的错误无法及时得到解决。 谢谢!
发生了什么事:最近我发现有许多驱逐出局似乎是由于磁盘压力所致:
$$$ kubectl get pod kumo-go-api-d46f56779-jl6s2 --namespace=kumo-main -o yaml
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: 2018-12-06T10:05:25Z
generateName: kumo-go-api-d46f56779-
labels:
io.kompose.service: kumo-go-api
pod-template-hash: "802912335"
name: kumo-go-api-d46f56779-jl6s2
namespace: kumo-main
ownerReferences:
- apiVersion: extensions/v1beta1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: kumo-go-api-d46f56779
uid: c0a9355e-f780-11e8-b336-42010aa80057
resourceVersion: "11617978"
selfLink: /api/v1/namespaces/kumo-main/pods/kumo-go-api-d46f56779-jl6s2
uid: 7337e854-f93e-11e8-b336-42010aa80057
spec:
containers:
- env:
- redacted...
image: gcr.io/<redacted>/kumo-go-api<strong i="8">@sha256</strong>:c6a94fc1ffeb09ea6d967f9ab14b9a26304fa4d71c5798acbfba5e98125b81da
imagePullPolicy: Always
name: kumo-go-api
ports:
- containerPort: 5000
protocol: TCP
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-t6jkx
readOnly: true
dnsPolicy: ClusterFirst
nodeName: gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: default-token-t6jkx
secret:
defaultMode: 420
secretName: default-token-t6jkx
status:
message: 'The node was low on resource: nodefs.'
phase: Failed
reason: Evicted
startTime: 2018-12-06T10:05:25Z
看一下kubectl get events
,我看到以下警告:
$$$ kubectl get events
LAST SEEN FIRST SEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE
2m 13h 152 gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s.156e07f40b90ed91 Node Warning ImageGCFailed kubelet, gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s (combined from similar events): failed to garbage collect required amount of images. Wanted to free 473948979 bytes, but freed 0 bytes
37m 37m 1 gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s.156e3127ebc715c3 Node Warning ImageGCFailed kubelet, gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s failed to garbage collect required amount of images. Wanted to free 473674547 bytes, but freed 0 bytes
深入挖掘:
$$$ kubectl get event gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s.156e07f40b90ed91 -o yaml
apiVersion: v1
count: 153
eventTime: null
firstTimestamp: 2018-12-07T11:01:06Z
involvedObject:
kind: Node
name: gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s
uid: gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s
kind: Event
lastTimestamp: 2018-12-08T00:16:09Z
message: '(combined from similar events): failed to garbage collect required amount
of images. Wanted to free 474006323 bytes, but freed 0 bytes'
metadata:
creationTimestamp: 2018-12-07T11:01:07Z
name: gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s.156e07f40b90ed91
namespace: default
resourceVersion: "381976"
selfLink: /api/v1/namespaces/default/events/gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s.156e07f40b90ed91
uid: 65916e4b-fa0f-11e8-ae9a-42010aa80058
reason: ImageGCFailed
reportingComponent: ""
reportingInstance: ""
source:
component: kubelet
host: gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s
type: Warning
实际上这里几乎没有。 此消息没有说明有关ImageGC为何启动或为何无法恢复更多空间的任何信息。
您预期会发生的情况:Image GC可以正常工作,或者至少无法将Pod调度到没有足够磁盘空间的节点上。
如何复制它(尽可能最小且精确) :在节点上运行并停止尽可能多的Pod,以增加磁盘压力。 然后观察这些错误。
我们还需要知道什么吗? :不适用
环境:
kubectl version
):Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.7", GitCommit:"0c38c362511b20a098d7cd855f1314dad92c2780", GitTreeState:"clean", BuildDate:"2018-08-20T10:09:03Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"10+", GitVersion:"v1.10.7-gke.11", GitCommit:"fa90543563c9cfafca69128ce8cd9ecd5941940f", GitTreeState:"clean", BuildDate:"2018-11-08T20:22:21Z", GoVersion:"go1.9.3b4", Compiler:"gc", Platform:"linux/amd64"}
uname -a
): Darwin D-10-19-169-80.dhcp4.washington.edu 18.0.0 Darwin Kernel Version 18.0.0: Wed Aug 22 20:13:40 PDT 2018; root:xnu-4903.201.2~1/RELEASE_X86_64 x86_64
/种类错误
/ sig gcp
我刚刚将主版本和节点升级到1.11.3-gke.18,以查看是否有帮助,但是我仍然看到完全相同的东西。
FWIW“引导磁盘大小(以GB为单位)(每个节点)”设置为最小10 Gb。
@samuela关于此问题有任何更新吗? 我看到同样的问题。
@hgokavarapuz据我
@samuela我在AWS上看到了此问题,但可以通过使用其他AMI来解决。 尽管它会导致AMI,但必须检查AMI有什么区别。
@hgokavarapuz有趣的是……也许这与节点OS /设置有关。
但是,必须调试更多确切导致此问题的原因。
在2018年12月12日,星期三,1:23 PM samuela [email protected]写道:
@hgokavarapuz https://github.com/hgokavarapuz有趣的...也许这
然后与节点OS /设置有关。-
您收到此邮件是因为有人提到您。
直接回复此电子邮件,在GitHub上查看
https://github.com/kubernetes/kubernetes/issues/71869#issuecomment-446748663 ,
或使线程静音
https://github.com/notifications/unsubscribe-auth/AmWWLdQjFnWgM5jeutfY6YqJBQ9l2l8gks5u4XO2gaJpZM4ZJWSq
。
-
谢谢
传承人
@hgokavarapuz检查kubelet日志中的线索
我能够解决这是我使用的AMI的一个问题,该AMI将/ var文件夹安装到具有某些限制大小的EBS卷上,从而导致了Docker容器创建的问题。 从日志中不能直接看出来,但是检查空间和其他内容可以使它变得清晰。
@hgokavarapuz您确定这确实可以解决问题,并且不仅需要下载更多图像才能出现此错误吗?
以我为例,这是在GKE允许的磁盘大小范围内发生的,所以我想说,至少在GKE中肯定仍然存在某种错误。
为了在节点上运行kubernetes而不会出现此错误,在最小磁盘大小上具有某种官方位置也将是一件好事。 否则,尚不清楚要运行kubernetes的规格必须精确到多大的卷。
@samuela我还没有尝试过GKE,但这就是某些AMI的AWS上的问题。 也许GKE有问题。
我们在GKE v1.11.5-gke.4上遇到了类似的问题。 如以下事件所示,GC似乎跟不上某些问题:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FreeDiskSpaceFailed 47m kubelet, gke-v11-service-graph-pool-c6e93d11-k6h6 failed to garbage collect required amount of images. Wanted to free 758374400 bytes, but freed 375372075 bytes
Warning FreeDiskSpaceFailed 42m kubelet, gke-v11-service-graph-pool-c6e93d11-k6h6 failed to garbage collect required amount of images. Wanted to free 898760704 bytes, but freed 0 bytes
Warning ImageGCFailed 42m kubelet, gke-v11-service-graph-pool-c6e93d11-k6h6 failed to garbage collect required amount of images. Wanted to free 898760704 bytes, but freed 0 bytes
Normal NodeHasDiskPressure 37m kubelet, gke-v11-service-graph-pool-c6e93d11-k6h6 Node gke-v11-service-graph-pool-c6e93d11-k6h6 status is now: NodeHasDiskPressure
Warning FreeDiskSpaceFailed 37m kubelet, gke-v11-service-graph-pool-c6e93d11-k6h6 failed to garbage collect required amount of images. Wanted to free 1430749184 bytes, but freed 0 bytes
Warning ImageGCFailed 37m kubelet, gke-v11-service-graph-pool-c6e93d11-k6h6 failed to garbage collect required amount of images. Wanted to free 1430749184 bytes, but freed 0 bytes
Warning EvictionThresholdMet 36m (x21 over 37m) kubelet, gke-v11-service-graph-pool-c6e93d11-k6h6 Attempting to reclaim ephemeral-storage
Warning ImageGCFailed 32m kubelet, gke-v11-service-graph-pool-c6e93d11-k6h6 failed to garbage collect required amount of images. Wanted to free 1109360640 bytes, but freed 0 bytes
Warning FreeDiskSpaceFailed 27m kubelet, gke-v11-service-graph-pool-c6e93d11-k6h6 failed to garbage collect required amount of images. Wanted to free 1367126016 bytes, but freed 0 bytes
Warning ImageGCFailed 22m kubelet, gke-v11-service-graph-pool-c6e93d11-k6h6 failed to garbage collect required amount of images. Wanted to free 1885589504 bytes, but freed 0 bytes
Warning FreeDiskSpaceFailed 17m kubelet, gke-v11-service-graph-pool-c6e93d11-k6h6 failed to garbage collect required amount of images. Wanted to free 2438008832 bytes, but freed 0 bytes
Warning FreeDiskSpaceFailed 12m kubelet, gke-v11-service-graph-pool-c6e93d11-k6h6 failed to garbage collect required amount of images. Wanted to free 2223022080 bytes, but freed 0 bytes
Warning ImageGCFailed 7m kubelet, gke-v11-service-graph-pool-c6e93d11-k6h6 failed to garbage collect required amount of images. Wanted to free 2358378496 bytes, but freed 0 bytes
Normal NodeHasNoDiskPressure 2m (x4 over 4h) kubelet, gke-v11-service-graph-pool-c6e93d11-k6h6 Node gke-v11-service-graph-pool-c6e93d11-k6h6 status is now: NodeHasNoDiskPressure
扫描kubelet日志,我看到以下条目:
Feb 07 21:15:31 gke-v11-service-graph-pool-c6e93d11-k6h6 kubelet[1594]: I0207 21:15:31.447179 1594 image_gc_manager.go:300] [imageGCManager]: Disk usage on image filesystem is at 99% which is over the high threshold (85%). Trying to free 2358378496 byte
Feb 07 21:15:31 gke-v11-service-graph-pool-c6e93d11-k6h6 kubelet[1594]: E0207 21:15:31.452366 1594 kubelet.go:1253] Image garbage collection failed multiple times in a row: failed to garbage collect required amount of images. Wanted to free 2358378496 b
Feb 07 21:15:31 gke-v11-service-graph-pool-c6e93d11-k6h6 kubelet[1594]: I0207 21:15:31.711566 1594 kuberuntime_manager.go:513] Container {Name:metadata-agent Image:gcr.io/stackdriver-agents/stackdriver-metadata-agent:0.2-0.0.21-1 Command:[] Args:[-o Kub
Feb 07 21:15:32 gke-v11-service-graph-pool-c6e93d11-k6h6 kubelet[1594]: I0207 21:15:32.004882 1594 cloud_request_manager.go:89] Requesting node addresses from cloud provider for node "gke-v11-service-graph-pool-c6e93d11-k6h6"
Feb 07 21:15:32 gke-v11-service-graph-pool-c6e93d11-k6h6 kubelet[1594]: I0207 21:15:32.008529 1594 cloud_request_manager.go:108] Node addresses from cloud provider for node "gke-v11-service-graph-pool-c6e93d11-k6h6" collected
Feb 07 21:15:34 gke-v11-service-graph-pool-c6e93d11-k6h6 kubelet[1594]: I0207 21:15:34.817530 1594 kube_docker_client.go:348] Stop pulling image "gcr.io/stackdriver-agents/stackdriver-logging-agent:0.8-1.6.2-1": "e807eb07af89: Extracting [==============
Feb 07 21:15:34 gke-v11-service-graph-pool-c6e93d11-k6h6 kubelet[1594]: E0207 21:15:34.817616 1594 remote_image.go:108] PullImage "gcr.io/stackdriver-agents/stackdriver-logging-agent:0.8-1.6.2-1" from image service failed: rpc error: code = Unknown desc
Feb 07 21:15:34 gke-v11-service-graph-pool-c6e93d11-k6h6 kubelet[1594]: E0207 21:15:34.817823 1594 kuberuntime_manager.go:733] container start failed: ErrImagePull: rpc error: code = Unknown desc = failed to register layer: Error processing tar file(exi
Feb 07 21:15:35 gke-v11-service-graph-pool-c6e93d11-k6h6 kubelet[1594]: W0207 21:15:35.057924 1594 kubelet_getters.go:264] Path "/var/lib/kubelet/pods/652e958e-2b1d-11e9-827c-42010a800fdc/volumes" does not exist
Feb 07 21:15:35 gke-v11-service-graph-pool-c6e93d11-k6h6 kubelet[1594]: I0207 21:15:35.058035 1594 eviction_manager.go:400] eviction manager: pods fluentd-gcp-v3.1.1-spdfd_kube-system(652e958e-2b1d-11e9-827c-42010a800fdc) successfully cleaned up
Feb 07 21:15:35 gke-v11-service-graph-pool-c6e93d11-k6h6 kubelet[1594]: E0207 21:15:35.091740 1594 pod_workers.go:186] Error syncing pod 7e06145a-2b1d-11e9-827c-42010a800fdc ("fluentd-gcp-v3.1.1-bgdg6_kube-system(7e06145a-2b1d-11e9-827c-42010a800fdc)"),
Feb 07 21:15:35 gke-v11-service-graph-pool-c6e93d11-k6h6 kubelet[1594]: W0207 21:15:35.179545 1594 eviction_manager.go:329] eviction manager: attempting to reclaim ephemeral-storage
似乎有些东西在阻止GC以足够快的速度回收存储。 该节点看起来好像最终可以恢复,但是在此过程中有一些吊舱被逐出。
我遇到了同样的问题。 我在kops上使用kops部署了堆栈,而我的k8s版本是1.11.6。 问题是当磁盘压力发生时,我每周都会有一个应用程序停机时间。
同样的问题。 我扩展了ebs卷,认为可以解决该问题。
使用
阿米k8s-1.10-debian-jessie-amd64-hvm-ebs-2018-08-17(ami-009b9699070ffc46f)
我在AKS上也遇到过类似的问题。 当我们用az cli
缩小集群然后放大时,我认为新节点是干净的,我的意思是没有任何垃圾,但是
$ kubectl get no
NAME STATUS ROLES AGE VERSION
aks-agentpool-11344223-0 Ready agent 77d v1.12.4
aks-agentpool-11344223-1 Ready agent 9h v1.12.4
aks-agentpool-11344223-2 Ready agent 9h v1.12.4
aks-agentpool-11344223-3 Ready agent 9h v1.12.4
aks-agentpool-11344223-4 Ready agent 9h v1.12.4
aks-agentpool-11344223-5 Ready agent 9h v1.12.4
当我进入其中一个时,我会看到很多像
$ docker images | grep addon-resizer
k8s.gcr.io/addon-resizer 1.8.4 5ec630648120 6 months ago 38.3MB
k8s.gcr.io/addon-resizer 1.8.1 6c0dbeaa8d20 17 months ago 33MB
k8s.gcr.io/addon-resizer 1.7 9b0815c87118 2 years ago 39MB
要么
$ docker images | grep k8s.gcr.io/cluster-autoscaler
k8s.gcr.io/cluster-autoscaler v1.14.0 ef6c40006faf 7 weeks ago 142MB
k8s.gcr.io/cluster-autoscaler v1.13.2 0f47d27d8e0d 2 months ago 137MB
k8s.gcr.io/cluster-autoscaler v1.12.3 9119261ec106 2 months ago 232MB
k8s.gcr.io/cluster-autoscaler v1.3.7 c711df426ac6 2 months ago 217MB
k8s.gcr.io/cluster-autoscaler v1.12.2 d67faca6c0aa 3 months ago 232MB
k8s.gcr.io/cluster-autoscaler v1.13.1 39c073d73c1e 5 months ago 137MB
k8s.gcr.io/cluster-autoscaler v1.3.4 6168be341178 6 months ago 217MB
k8s.gcr.io/cluster-autoscaler v1.3.3 bd9362bb17a5 7 months ago 217MB
k8s.gcr.io/cluster-autoscaler v1.2.2 2378f4474aa3 11 months ago 209MB
k8s.gcr.io/cluster-autoscaler v1.1.2 e137f4b4d451 14 months ago 198MB
这很疯狂,因为我看到很多下面的错误
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FreeDiskSpaceFailed 15m kubelet, aks-agentpool-11344223-5 failed to garbage collect required amount of images. Wanted to free 1297139302 bytes, but freed 0 bytes
Warning FreeDiskSpaceFailed 10m kubelet, aks-agentpool-11344223-5 failed to garbage collect required amount of images. Wanted to free 1447237222 bytes, but freed 0 bytes
Warning ImageGCFailed 10m kubelet, aks-agentpool-11344223-5 failed to garbage collect required amount of images. Wanted to free 1447237222 bytes, but freed 0 bytes
@samuela :在这个问题上没有信号标签。 请通过以下任一方式添加sig标签:
提及信号: @kubernetes/sig-<group-name>-<group-suffix>
例如, @kubernetes/sig-contributor-experience-<group-suffix>
通知贡献者体验信号,或者
手动指定标签: /sig <group-name>
例如, /sig scalability
sig/scalability
贴上
注意:方法1将触发一封电子邮件给该组。 请参阅组列表。
方法1中的<group-suffix>
必须替换为以下之一:_ bug,功能请求,pr审查,测试失败,提议_。
可在此处获得使用PR注释与我互动的说明。 如果您对我的行为有任何疑问或建议,请针对kubernetes / test-infra存储库提出问题。
我正在使用v1.11.10在Openstack上实现
节点完全没有磁盘空间,并且kubelet日志现在循环到:
E1029 06:41:37.397348 8907 remote_runtime.go:278] ContainerStatus "redacted" from runtime service failed: rpc error: code = Unknown desc = unable to inspect docker image "sha256:redacted" while inspecting docker container "redacted": no such image: "sha256:redacted"
Oct 29 06:41:37 node-name bash[8907]: E1029 06:41:37.397378 8907 kuberuntime_container.go:391] ContainerStatus for redacted error: rpc error: code = Unknown desc = unable to inspect docker image "sha256:redacted" while inspecting docker container "redacted": no such image: "sha256:redacted"
Oct 29 06:41:37 node-name bash[8907]: E1029 06:41:37.397388 8907 kuberuntime_manager.go:873] getPodContainerStatuses for pod "coredns-49t6c_kube-system(redacted)" failed: rpc error: code = Unknown desc = unable to inspect docker image "sha256:redacted" while inspecting docker container "redacted": no such image: "sha256:redacted"
Oct 29 06:41:37 node-name bash[8907]: E1029 06:41:37.397404 8907 generic.go:241] PLEG: Ignoring events for pod coredns-49t6c/kube-system: rpc error: code = Unknown desc = unable to inspect docker image "sha256:redacted" while inspecting docker container "redacted": no such image: "sha256:redacted"
对我来说,此问题是由容器在短时间内占用大量磁盘空间引起的。 这发生在多个节点中。 容器被逐出(节点中的每个容器都是),但是kubelet并未回收磁盘。
我必须du /var/lib/docker/overlay -h | sort -h
来查找正在执行此操作的容器并手动将其删除。 这使节点脱离Disk Pressure
并恢复(其中一个需要reboot -f
)。
这对我来说也正在发生。 我在EKS群集中有8个节点,由于某种原因,只有一个节点出现此GC问题。 这已经发生了两次,以下是我为解决此问题所做的步骤。 有人知道这样做的更好/受支持的方法吗? https://kubernetes.io/docs/tasks/administer-cluster/cluster-management/#maintenance -on-a-node
面临同样的问题。
kubectl drain --delete-local-data --ignore-daemonsets $NODE_IP && kubectl uncordon $NODE_IP
足以清理磁盘存储。
FWIW“引导磁盘大小(以GB为单位)(每个节点)”设置为最小10 Gb。
非常感谢你。 它与我合作
/ sig节点
@ HayTran94 @samuela @KIVagant @dattim
如果某些图像不适合使用GC,则realImageGCManager#freeSpace的日志级别为5。
例如
if image.lastUsed.Equal(freeTime) || image.lastUsed.After(freeTime) {
klog.V(5).Infof("Image ID %s has lastUsed=%v which is >= freeTime=%v, not eligible for garbage collection", image.id, image.lastUsed, freeTime)
continue
您能否将日志级别设置为5,看看realImageGCManager#freeSpace是否提供了一些线索?
谢谢
@rubencabrera
在您发布的日志中:
no such image: "sha256:redacted"
您是否有机会验证基础图像是否存在?
谢谢
请让我脱离这个循环。
不确定为什么要在此电子邮件中复制我
感谢和问候,
阿舒托什·辛格
2020年4月13日星期一,00:21于志宏[email protected]写道:
@rubencabrera https://github.com/rubencabrera
在您发布的日志中:没有这样的图像:“ sha256:已编辑”
您是否有机会验证基础图像是否存在或
不是吗谢谢
-
您收到此消息是因为您已订阅此线程。
直接回复此电子邮件,在GitHub上查看
https://github.com/kubernetes/kubernetes/issues/71869#issuecomment-612684868 ,
或退订
https://github.com/notifications/unsubscribe-auth/ADS6CKHTR2QTDJOWNKMLX23RMI5FXANCNFSM4GJFMSVA
。
@rubencabrera
在您发布的日志中:no such image: "sha256:redacted"
您是否有机会验证基础图像是否存在?
谢谢
嗨@tedyu
是的,我检查了一下,我们使用了一些私有存储库,并且不可用的映像经常出现问题,因此这是我看到该错误时的第一想到。 该映像可用并且正在同一群集的其他节点中运行。
有没有人想过一种说服k8s垃圾收集在不是根文件系统的磁盘上触发的方法? 我们必须为/ var / lib / docker使用辅助(SSD)磁盘来解决EKS性能问题(请参阅https://github.com/awslabs/amazon-eks-ami/issues/454)。 但是垃圾收集不会触发,有时我们会溢出该辅助磁盘。
闲置90天后,问题变得陈旧。
使用/remove-lifecycle stale
将问题标记为新问题。
再过30天不活动后,陈旧的问题就会消失,并最终关闭。
如果现在可以安全解决此问题,请使用/close
进行关闭。
将反馈发送到sig-testing,kubernetes / test-infra和/或fejta 。
/ lifecycle stale
/删除生命周期过时
上周,我们开始遭受这个问题的困扰。 使用Kops 1.17.1构建的Kubernetes 1.17.9,使用k8s-1.17-debian-stretch-amd64-hvm-ebs-2020-01-17 AMI,docker 19.03.11自托管在AWS中。
上个星期内,这发生在两个单独的节点上,两个节点都显示以下内容:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FreeDiskSpaceFailed 10m (x204 over 17h) kubelet, ip-10-224-54-0.us-west-2.compute.internal (combined from similar events): failed to garbage collect required amount of images. Wanted to free 5877565849 bytes, but freed 101485977 bytes
Warning ImageGCFailed 18s (x205 over 17h) kubelet, ip-10-224-54-0.us-west-2.compute.internal (combined from similar events): failed to garbage collect required amount of images. Wanted to free 5886654873 bytes, but freed 0 bytes
节点上的du
和df
在使用多少空间上不一致:
admin@ip-10-224-54-0:~$ df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p2 57G 48G 5.8G 90% /
admin@ip-10-224-54-0:~$ sudo du -sh /
du: cannot access '/proc/9856/task/9856/fd/3': No such file or directory
du: cannot access '/proc/9856/task/9856/fdinfo/3': No such file or directory
du: cannot access '/proc/9856/fd/4': No such file or directory
du: cannot access '/proc/9856/fdinfo/4': No such file or directory
11G /
admin@ip-10-224-54-0:~$ sudo du -sh --one-file-system /
6.6G /
将根设备挂载到另一个挂载点以摆脱其他已挂载的文件系统将获得du
来在使用空间上达成一致,但是df
仍然不同意:
admin@ip-10-224-54-0:~$ mkdir tmproot
admin@ip-10-224-54-0:~$ sudo mount /dev/nvme0n1p2 /home/admin/tmproot
admin@ip-10-224-54-0:~$ df -h tmproot/
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p2 57G 48G 5.8G 90% /home/admin/tmproot
admin@ip-10-224-54-0:~$ sudo du -sh tmproot/
6.6G tmproot/
admin@ip-10-224-54-0:~$ sudo du -sh --one-file-system tmproot/
6.6G tmproot/
我认为这可能是因为进程持有打开的已删除文件。 但是重新启动kubelet并不会释放此空间,而这可能是我怀疑导致此过程的过程。 重新启动docker也没有释放空间。
第一次发生这种情况,经过数小时毫无结果的调查后,我最终终止了Node,但是现在又再次发生了,我无法永久解决该问题。
有趣的数据点:集装箱式已删除文件已打开:
admin@ip-10-224-54-0:~$ sudo lsof 2>&1| grep -v "no pwd entry" | grep deleted
container 12469 root cwd DIR 0,19 40 1180407868 /run/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12 (deleted)
container 12469 root 4u FIFO 259,2 0t0 2097336 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stdout.log (deleted)
container 12469 root 6u FIFO 259,2 0t0 2097336 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stdout.log (deleted)
container 12469 root 7u FIFO 259,2 0t0 2097337 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stderr.log (deleted)
container 12469 root 8u FIFO 259,2 0t0 2097337 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stderr.log (deleted)
container 12469 12470 root cwd DIR 0,19 40 1180407868 /run/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12 (deleted)
container 12469 12470 root 4u FIFO 259,2 0t0 2097336 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stdout.log (deleted)
container 12469 12470 root 6u FIFO 259,2 0t0 2097336 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stdout.log (deleted)
container 12469 12470 root 7u FIFO 259,2 0t0 2097337 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stderr.log (deleted)
container 12469 12470 root 8u FIFO 259,2 0t0 2097337 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stderr.log (deleted)
container 12469 12471 root cwd DIR 0,19 40 1180407868 /run/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12 (deleted)
container 12469 12471 root 4u FIFO 259,2 0t0 2097336 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stdout.log (deleted)
container 12469 12471 root 6u FIFO 259,2 0t0 2097336 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stdout.log (deleted)
container 12469 12471 root 7u FIFO 259,2 0t0 2097337 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stderr.log (deleted)
container 12469 12471 root 8u FIFO 259,2 0t0 2097337 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stderr.log (deleted)
container 12469 12472 root cwd DIR 0,19 40 1180407868 /run/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12 (deleted)
container 12469 12472 root 4u FIFO 259,2 0t0 2097336 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stdout.log (deleted)
container 12469 12472 root 6u FIFO 259,2 0t0 2097336 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stdout.log (deleted)
container 12469 12472 root 7u FIFO 259,2 0t0 2097337 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stderr.log (deleted)
container 12469 12472 root 8u FIFO 259,2 0t0 2097337 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stderr.log (deleted)
container 12469 12473 root cwd DIR 0,19 40 1180407868 /run/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12 (deleted)
container 12469 12473 root 4u FIFO 259,2 0t0 2097336 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stdout.log (deleted)
container 12469 12473 root 6u FIFO 259,2 0t0 2097336 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stdout.log (deleted)
container 12469 12473 root 7u FIFO 259,2 0t0 2097337 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stderr.log (deleted)
container 12469 12473 root 8u FIFO 259,2 0t0 2097337 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stderr.log (deleted)
container 12469 12474 root cwd DIR 0,19 40 1180407868 /run/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12 (deleted)
container 12469 12474 root 4u FIFO 259,2 0t0 2097336 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stdout.log (deleted)
container 12469 12474 root 6u FIFO 259,2 0t0 2097336 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stdout.log (deleted)
container 12469 12474 root 7u FIFO 259,2 0t0 2097337 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stderr.log (deleted)
container 12469 12474 root 8u FIFO 259,2 0t0 2097337 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stderr.log (deleted)
container 12469 12475 root cwd DIR 0,19 40 1180407868 /run/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12 (deleted)
container 12469 12475 root 4u FIFO 259,2 0t0 2097336 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stdout.log (deleted)
container 12469 12475 root 6u FIFO 259,2 0t0 2097336 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stdout.log (deleted)
container 12469 12475 root 7u FIFO 259,2 0t0 2097337 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stderr.log (deleted)
container 12469 12475 root 8u FIFO 259,2 0t0 2097337 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stderr.log (deleted)
container 12469 12476 root cwd DIR 0,19 40 1180407868 /run/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12 (deleted)
container 12469 12476 root 4u FIFO 259,2 0t0 2097336 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stdout.log (deleted)
container 12469 12476 root 6u FIFO 259,2 0t0 2097336 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stdout.log (deleted)
container 12469 12476 root 7u FIFO 259,2 0t0 2097337 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stderr.log (deleted)
container 12469 12476 root 8u FIFO 259,2 0t0 2097337 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stderr.log (deleted)
container 12469 12477 root cwd DIR 0,19 40 1180407868 /run/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12 (deleted)
container 12469 12477 root 4u FIFO 259,2 0t0 2097336 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stdout.log (deleted)
container 12469 12477 root 6u FIFO 259,2 0t0 2097336 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stdout.log (deleted)
container 12469 12477 root 7u FIFO 259,2 0t0 2097337 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stderr.log (deleted)
container 12469 12477 root 8u FIFO 259,2 0t0 2097337 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stderr.log (deleted)
container 12469 19325 root cwd DIR 0,19 40 1180407868 /run/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12 (deleted)
container 12469 19325 root 4u FIFO 259,2 0t0 2097336 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stdout.log (deleted)
container 12469 19325 root 6u FIFO 259,2 0t0 2097336 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stdout.log (deleted)
container 12469 19325 root 7u FIFO 259,2 0t0 2097337 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stderr.log (deleted)
container 12469 19325 root 8u FIFO 259,2 0t0 2097337 /var/lib/containerd/io.containerd.runtime.v1.linux/moby/34089ad41629df20f181ed191acec724c79fc879dc49287d29184f2fedfaba12/shim.stderr.log (deleted)
重新启动containerd.service也没有释放空间或摆脱这些文件句柄。
最有用的评论
面临同样的问题。
kubectl drain --delete-local-data --ignore-daemonsets $NODE_IP && kubectl uncordon $NODE_IP
足以清理磁盘存储。