Trident: 由于 blkid 超时导致随机 pvc 挂载失败

创建于 2020-10-14 · 4评论 · 资料来源: NetApp/trident

描述错误
我们观察到很多 Pod 随机停留在ContainerCreating状态，因为它们无法挂载 netapp pvcs。在获取 iSCSI 设备信息和 trident daemonset 日志时描述 pod 指向超时，这表明这是由于在主机上执行blkid命令时超时造成的。
这是随机事件，尚不能与某些 pvc 和主机相关联。

环境

三叉戟版本：20.07.1
使用的 Trident 安装标志： tridentctl install -n sys-trident --generate-custom-yaml （然后 kubectl apply -f - 用于生成的清单）
容器运行时：docker://19.3.12
Kubernetes 版本：v1.19.2
Kubernetes 编排器：无
Kubernetes 启用的功能门：kubernetes 仅默认
操作系统：Kinvolk 2605.6.0 (Oklo) 5.4.67-flatcar 的 Flatcar Container Linux
NetApp 后端类型：ontap-san、ONTAP 9.7.0

重现
我们可以在我们的集群节点上随机观察它（似乎与特定的一组主机无关）

预期行为
在合理的时间范围内一致的挂载行为。

附加上下文
通过描述一个“卡住”的 pod，我们可以看到：

Events:
  Type     Reason       Age                  From     Message
  ----     ------       ----                 ----     -------
  Warning  FailedMount  51m (x5 over 72m)    kubelet  Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[data thanos-storage vault-tls thanos-store-token-lj9j5]: timed out waiting for the condition
  Warning  FailedMount  42m (x3 over 56m)    kubelet  Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[thanos-storage vault-tls thanos-store-token-lj9j5 data]: timed out waiting for the condition
  Warning  FailedMount  17m (x15 over 74m)   kubelet  Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[vault-tls thanos-store-token-lj9j5 data thanos-storage]: timed out waiting for the condition
  Warning  FailedMount  14m (x22 over 74m)   kubelet  MountVolume.MountDevice failed for volume "pvc-e22cdf07-acfc-42af-a46a-bffd5ac32514" : rpc error: code = Internal desc = error getting iSCSI device information: process killed after timeout
  Warning  FailedMount  4m11s (x4 over 69m)  kubelet  Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[thanos-store-token-lj9j5 data thanos-storage vault-tls]: timed out waiting for the condition

在来自 trident daemonset 各自 pod 日志的同一节点上：

time="2020-10-14T14:32:41Z" level=debug msg=">>>> osutils.execCommandWithTimeout." args="[if=/dev/sdc bs=4096 count=1 status=none]" command=dd timeoutSeconds=5s
time="2020-10-14T14:32:41Z" level=debug msg="<<<< osutils.execCommandWithTimeout." command=dd error="<nil>"
time="2020-10-14T14:32:41Z" level=debug msg="<<<< osutils.ensureDeviceReadable"
time="2020-10-14T14:32:41Z" level=debug msg=">>>> osutils.getFSType" device=/dev/sdc
time="2020-10-14T14:32:41Z" level=debug msg=">>>> osutils.waitForDevice" device=/dev/sdc
time="2020-10-14T14:32:41Z" level=debug msg="Device found." device=/dev/sdc
time="2020-10-14T14:32:41Z" level=debug msg="<<<< osutils.waitForDevice" device=/dev/sdc
time="2020-10-14T14:32:41Z" level=debug msg=">>>> osutils.execCommandWithTimeout." args="[/dev/sdc]" command=blkid timeoutSeconds=5s
time="2020-10-14T14:32:46Z" level=error msg="process killed after timeout" process=blkid
time="2020-10-14T14:32:46Z" level=debug msg="<<<< osutils.execCommandWithTimeout." command=blkid error="process killed after timeout"
time="2020-10-14T14:32:46Z" level=debug msg="<<<< osutils.getFSType"
time="2020-10-14T14:32:46Z" level=debug msg="<<<< osutils.getDeviceInfoForLUN" iSCSINodeName="iqn.1992-08.com.netapp:sn.0205ffce026911ebb4d9d039ea1a7953:vs.9" lunID=1 needFSType=true
time="2020-10-14T14:32:46Z" level=debug msg="<<<< osutils.AttachISCSIVolume"
time="2020-10-14T14:32:46Z" level=debug msg="<<<< NodeStageVolume" Method=NodeStageVolume Type=CSI_Node
time="2020-10-14T14:32:46Z" level=debug msg="Released shared lock (NodeStageVolume-pvc-e22cdf07-acfc-42af-a46a-bffd5ac32514)." lock=csi_node_server
time="2020-10-14T14:32:46Z" level=error msg="GRPC error: rpc error: code = Internal desc = error getting iSCSI device information: process killed after timeout"

看起来blkid无法在允许的时间窗口内返回（？）
如果我们 ssh 进入主机并尝试相同的命令：

$ time sudo blkid /dev/sdc
/dev/sdc: UUID="f593b708-ed88-47b7-88ce-f9b8c85ab96b" TYPE="ext4"

real    0m36.393s
user    0m0.016s
sys     0m0.021s

我们的后端 json 配置：
```
{
“版本”：1，
"storageDriverName": "ontap-san",
"managementLIF": "10.20.50.6",
“dataLIF”：“10.20.50.4”，
"svm": "dev_kube",
"igroupName": "dev_kube_trident",
“用户名”：“xxxxxxxxx”，
“密码”：“xxxxxxxxxxx”，
“默认值”：{
“加密”：“真”
}
}
````

我们真的被困在这里，所以对此的任何帮助将不胜感激！

bug

资料来源

ffilippopoulos

👍4

最有用的评论

关闭这个，因为我们调试了我们的网络链接速度后还没有看到它。非常感谢你的帮助！ :))

ffilippopoulos 于 2020-10-26

👍2

所有4条评论

嗨@ffilippopoulos ，

正如您所指出的， blkid 是主机级别的命令。这个命令在超时之前返回的能力不是Trident可以控制的。当您 ssh 进入主机并从 shell 运行 blkid 时，Trident 基本上在做同样的事情。您检查过主机上的负载吗？

此外，如果您在此问题上需要立即帮助，请联系 NetApp 支持。

要向 NetApp 提出案例，请访问https://mysupport.netapp.com/site/。
左下角，点击“联系支持”
从您所在地区找到合适的号码以拨打电话或登录。
注意：Trident 未在页面上列出，但它是 NetApp 基于受支持的 Netapp 存储 SN 支持的产品。
在 NetApp 存储 SN 上打开案例，并提供问题描述。
请务必提及该产品是 Kubernetes 上的 Trident，并提供详细信息。提到这个 GitHub。
该案例将直接提交给 Trident 支持工程师以寻求回应。

gnarl 于 2020-10-14

嘿@gnarl感谢您的快速回复。据我所知，这个命令有 5 秒的硬超时，它在 tridents 字段上： https ://github.com/NetApp/trident/blob/0a245d3895af31f910a58c2f26e5a0f8b25f34f8/utils/osutils.go#L2306

据我所见，我们的节点根本没有加载（例如在节点上，我们现在看到了问题load average: 0.54, 0.62, 0.61 ）并且不认为这可以解释我们观察到的行为。
硬编码超时是否有原因？它是否可以防止某些我们不知道的情况？

ffilippopoulos 于 2020-10-14

@ffilippopoulos ， blkid 的运行时间不应接近 5 秒。如果您的主机上的负载看起来不错，您是否可以检查主机和 NetApp dataLIF 之间的网络延迟？

我们确实在 blkid 上有一个硬超时，因为如果 blkid 不起作用，那么 Trident 就无法安全地附加卷。

gnarl 于 2020-10-14

👍1

关闭这个，因为我们调试了我们的网络链接速度后还没有看到它。非常感谢你的帮助！ :))

ffilippopoulos 于 2020-10-26

👍2

此页面是否有帮助？

0 / 5 - 0 等级

Trident: 由于 blkid 超时导致随机 pvc 挂载失败

最有用的评论

所有4条评论

相关问题