Trident: Random pvc mount failures due to blkid timeouts

Created on 14 Oct 2020  ·  4Comments  ·  Source: NetApp/trident

Describe the bug
We observe a lot of pods randomly stuck at ContainerCreating state because they cannot mount netapp pvcs. Describing pods points to timeouts while getting iSCSI device information and trident daemonset logs suggest that this is due to timeouts while executing blkid command on the host.
This is a random occurrence and cannot yet correlate with certain pvcs and hosts.

Environment

  • Trident version: 20.07.1
  • Trident installation flags used: tridentctl install -n sys-trident --generate-custom-yaml (then kubectl apply -f - for the generated manifests)
  • Container runtime: docker://19.3.12
  • Kubernetes version: v1.19.2
  • Kubernetes orchestrator: none
  • Kubernetes enabled feature gates: kubernetes defaults only
  • OS: Flatcar Container Linux by Kinvolk 2605.6.0 (Oklo) 5.4.67-flatcar
  • NetApp backend types: ontap-san, ONTAP 9.7.0

To Reproduce
We can randomly observe it on our cluster nodes (does not seem to be related with a particular group of hosts)

Expected behavior
A consistent mount behaviour within a reasonable time window.

Additional context
By describing a "stuck" pod we can see:

Events:
  Type     Reason       Age                  From     Message
  ----     ------       ----                 ----     -------
  Warning  FailedMount  51m (x5 over 72m)    kubelet  Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[data thanos-storage vault-tls thanos-store-token-lj9j5]: timed out waiting for the condition
  Warning  FailedMount  42m (x3 over 56m)    kubelet  Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[thanos-storage vault-tls thanos-store-token-lj9j5 data]: timed out waiting for the condition
  Warning  FailedMount  17m (x15 over 74m)   kubelet  Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[vault-tls thanos-store-token-lj9j5 data thanos-storage]: timed out waiting for the condition
  Warning  FailedMount  14m (x22 over 74m)   kubelet  MountVolume.MountDevice failed for volume "pvc-e22cdf07-acfc-42af-a46a-bffd5ac32514" : rpc error: code = Internal desc = error getting iSCSI device information: process killed after timeout
  Warning  FailedMount  4m11s (x4 over 69m)  kubelet  Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[thanos-store-token-lj9j5 data thanos-storage vault-tls]: timed out waiting for the condition

on the same node from the trident daemonset respective pod logs:

time="2020-10-14T14:32:41Z" level=debug msg=">>>> osutils.execCommandWithTimeout." args="[if=/dev/sdc bs=4096 count=1 status=none]" command=dd timeoutSeconds=5s
time="2020-10-14T14:32:41Z" level=debug msg="<<<< osutils.execCommandWithTimeout." command=dd error="<nil>"
time="2020-10-14T14:32:41Z" level=debug msg="<<<< osutils.ensureDeviceReadable"
time="2020-10-14T14:32:41Z" level=debug msg=">>>> osutils.getFSType" device=/dev/sdc
time="2020-10-14T14:32:41Z" level=debug msg=">>>> osutils.waitForDevice" device=/dev/sdc
time="2020-10-14T14:32:41Z" level=debug msg="Device found." device=/dev/sdc
time="2020-10-14T14:32:41Z" level=debug msg="<<<< osutils.waitForDevice" device=/dev/sdc
time="2020-10-14T14:32:41Z" level=debug msg=">>>> osutils.execCommandWithTimeout." args="[/dev/sdc]" command=blkid timeoutSeconds=5s
time="2020-10-14T14:32:46Z" level=error msg="process killed after timeout" process=blkid
time="2020-10-14T14:32:46Z" level=debug msg="<<<< osutils.execCommandWithTimeout." command=blkid error="process killed after timeout"
time="2020-10-14T14:32:46Z" level=debug msg="<<<< osutils.getFSType"
time="2020-10-14T14:32:46Z" level=debug msg="<<<< osutils.getDeviceInfoForLUN" iSCSINodeName="iqn.1992-08.com.netapp:sn.0205ffce026911ebb4d9d039ea1a7953:vs.9" lunID=1 needFSType=true
time="2020-10-14T14:32:46Z" level=debug msg="<<<< osutils.AttachISCSIVolume"
time="2020-10-14T14:32:46Z" level=debug msg="<<<< NodeStageVolume" Method=NodeStageVolume Type=CSI_Node
time="2020-10-14T14:32:46Z" level=debug msg="Released shared lock (NodeStageVolume-pvc-e22cdf07-acfc-42af-a46a-bffd5ac32514)." lock=csi_node_server
time="2020-10-14T14:32:46Z" level=error msg="GRPC error: rpc error: code = Internal desc = error getting iSCSI device information: process killed after timeout"

it looks that blkid cannot return in the allowed time window(?)
if we ssh into the host and try the same command:

$ time sudo blkid /dev/sdc
/dev/sdc: UUID="f593b708-ed88-47b7-88ce-f9b8c85ab96b" TYPE="ext4"

real    0m36.393s
user    0m0.016s
sys     0m0.021s

our backend json config:
```
{
"version": 1,
"storageDriverName": "ontap-san",
"managementLIF": "10.20.50.6",
"dataLIF": "10.20.50.4",
"svm": "dev_kube",
"igroupName": "dev_kube_trident",
"username": "xxxxxxxxx",
"password": "xxxxxxxxxxx",
"defaults": {
"encryption": "true"
}
}
````

We are really stuck here, so any help on this one will be much appreciated!

bug

Most helpful comment

Closing this one as we haven't seen it since we debugged our network links speed. Thank you very much for the help! :))

All 4 comments

HI @ffilippopoulos,

As you pointed out blkid is a host level command. The ability of this command to return before it times out is not something that Trident can control. Trident is basically doing the same thing you are doing when you ssh into the host and run blkid from a shell. Have you examined the load on the host?

Also, if you need immediate assistance with this issue please contact NetApp Support.

To open a case with NetApp, please go to https://mysupport.netapp.com/site/.
Bottom left, Click on 'Contact Support'
Find the appropriate number from your region to call in, or login.
Note: Trident is not listed on the page, but is a supported product by NetApp based on a supported Netapp storage SN.
Open the case on the NetApp storage SN, and provide the description of the problem.
Be sure to mention the product is Trident on Kubernetes, and provide the details. Mention this GitHub.
The case will be directed to Trident support engineers for response.

hey @gnarl thank you for the quick response. As far as I can see there is a hard 5 seconds timeout on this command though, which is on tridents field: https://github.com/NetApp/trident/blob/0a245d3895af31f910a58c2f26e5a0f8b25f34f8/utils/osutils.go#L2306

as far as I can see our nodes are not loaded at all (for example on the node we now see the issue load average: 0.54, 0.62, 0.61) and do not think that this would explain the behaviour we observe.
Is there a reason for the hardcoded timeout? is it preventing some case we are not aware of?

@ffilippopoulos, blkid should not take anywhere near 5 seconds to run. If the load on your host looks good can you examine network latency between the host and the NetApp dataLIF?

We do have a hard timeout on blkid since if blkid does not work then Trident cannot safely attach the volume.

Closing this one as we haven't seen it since we debugged our network links speed. Thank you very much for the help! :))

Was this page helpful?
0 / 5 - 0 ratings