Trident: Cannot detach volumes attached to deleted nodes

Created on 14 Jan 2022 · 4Comments · Source: NetApp/trident

Describe the bug

We cannot detach volumes attached to deleted nodes in Trident 21.10.1. In Trident v21.07.2, these volumes would be automatically detached after a certain period. If I understand correctly, this force detachment is done by AttachDetachController after ReconcilerMaxWaitForUnmountDuration.

It seems that this change is introduced in this commit. This commit makes Trident's ControllerUnpublishVolume check the existence of the node. If the node does not exist, ControllerUnpublishVolume now returns a NotFound error, so the volume detachment always fails when the node is already deleted.

In server failure, volume detachment might fail, and we have no choice but to delete the node, so it is desirable to detach volumes attached to deleted nodes automatically.

Environment

Trident version: 21.10.1
Trident installation flags used: silenceAutosupport: true (Trident Operator)
Container runtime: Docker 20.10.11
Kubernetes version: 1.22.5
Kubernetes orchestrator: Kubernetes
Kubernetes enabled feature gates:
OS: Ubuntu 20.04.3 LTS
NetApp backend types: ONTAP AFF 9.7P13
Other:

To Reproduce

Create a StatefulSet that has a ontap-san volume
Delete the node object that the Pod is scheduled on by kubectl delete node
The StatefulSet controller recreates a new Pod on another node after a short time
The recreated Pod cannot be attached to the volume even after 1 hour
- With Trident v21.07.2, the Pod will become Running after 6 to 8 minutes

In the VolumeAttachment, the following error can be found.

rpc error: code = NotFound desc = node <NODE_NAME> was not found'

Expected behavior

Trident automatically detaches volumes attached to deleted nodes.

bug tracked

Source

tksm

👍3

Most helpful comment

@paalkr, the team is currently working on a fix. We will update this issue with a link to the commit once it merges.

gnarl on 28 Jan 2022

👍2

All 4 comments

We run a 100+ node Kubernetes cluster on AWS which heavily relies on spot nodes. Spot nodes will be terminated with just a few minutes warning on AWS, which expected to happen quite often. Even if we run the Node Termination Handler in SQS mode and react to spot termination notifications with automatic node draining we usually end up in a situation where the detach process doesn't finish before a node is deleted.

In this scenario we often encounter the exact same issue as described by @tksm. This is a severe problem as workloads will be stuck in a crashlooping state because the PVC fails to attached after the pod is moved to a new node. I hope the problem can be hotfixed.