Kubernetes: PV is stuck at terminating after PVC is deleted

Created on 11 Oct 2018  ·  59Comments  ·  Source: kubernetes/kubernetes

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug
/kind feature

What happened:
I was testing EBS CSI driver. I created a PV using PVC. Then I deleted the PVC. However, PV deletion is stuck in Terminating state. Both PVC and volume is deleted without any issue. CSI driver is kept being called with DeleteVolume even if it returns success when volume not found (because it is already gone).

CSI Driver log:

I1011 20:37:29.778380       1 controller.go:175] ControllerGetCapabilities: called with args &csi.ControllerGetCapabilitiesRequest{XXX_NoUnkeyedLiteral:struct {}{}, XXX_unrecognized:[]uint8(nil), XXX_sizecache:0}
I1011 20:37:29.780575       1 controller.go:91] DeleteVolume: called with args: &csi.DeleteVolumeRequest{VolumeId:"vol-0ea6117ddb69e78fb", ControllerDeleteSecrets:map[string]string(nil), XXX_NoUnkeyedLiteral:struct {}{}, XXX_unrecognized:[]uint8(nil), XXX_sizecache:0}
I1011 20:37:29.930091       1 controller.go:99] DeleteVolume: volume not found, returning with success

external attacher log:

I1011 19:15:14.931769       1 controller.go:167] Started VA processing "csi-3b15269e725f727786c5aec3b4da3f2eebc2477dec53d3480a3fe1dd01adea53"
I1011 19:15:14.931794       1 csi_handler.go:76] CSIHandler: processing VA "csi-3b15269e725f727786c5aec3b4da3f2eebc2477dec53d3480a3fe1dd01adea53"
I1011 19:15:14.931808       1 csi_handler.go:103] Attaching "csi-3b15269e725f727786c5aec3b4da3f2eebc2477dec53d3480a3fe1dd01adea53"
I1011 19:15:14.931823       1 csi_handler.go:208] Starting attach operation for "csi-3b15269e725f727786c5aec3b4da3f2eebc2477dec53d3480a3fe1dd01adea53"
I1011 19:15:14.931905       1 csi_handler.go:179] PV finalizer is already set on "pvc-069128c6ccdc11e8"
I1011 19:15:14.931947       1 csi_handler.go:156] VA finalizer is already set on "csi-3b15269e725f727786c5aec3b4da3f2eebc2477dec53d3480a3fe1dd01adea53"
I1011 19:15:14.931962       1 connection.go:235] GRPC call: /csi.v0.Controller/ControllerPublishVolume
I1011 19:15:14.931966       1 connection.go:236] GRPC request: volume_id:"vol-0ea6117ddb69e78fb" node_id:"i-06d0e08c9565c4db7" volume_capability:<mount:<fs_type:"ext4" > access_mode:<mode:SINGLE_NODE_WRITER > > volume_attributes:<key:"storage.kubernetes.io/csiProvisionerIdentity" value:"1539123546345-8081-com.amazon.aws.csi.ebs" >
I1011 19:15:14.935053       1 controller.go:197] Started PV processing "pvc-069128c6ccdc11e8"
I1011 19:15:14.935072       1 csi_handler.go:350] CSIHandler: processing PV "pvc-069128c6ccdc11e8"
I1011 19:15:14.935106       1 csi_handler.go:386] CSIHandler: processing PV "pvc-069128c6ccdc11e8": VA "csi-3b15269e725f727786c5aec3b4da3f2eebc2477dec53d3480a3fe1dd01adea53" found
I1011 19:15:14.952590       1 controller.go:197] Started PV processing "pvc-069128c6ccdc11e8"
I1011 19:15:14.952613       1 csi_handler.go:350] CSIHandler: processing PV "pvc-069128c6ccdc11e8"
I1011 19:15:14.952654       1 csi_handler.go:386] CSIHandler: processing PV "pvc-069128c6ccdc11e8": VA "csi-3b15269e725f727786c5aec3b4da3f2eebc2477dec53d3480a3fe1dd01adea53" found
I1011 19:15:15.048026       1 controller.go:197] Started PV processing "pvc-069128c6ccdc11e8"
I1011 19:15:15.048048       1 csi_handler.go:350] CSIHandler: processing PV "pvc-069128c6ccdc11e8"
I1011 19:15:15.048167       1 csi_handler.go:386] CSIHandler: processing PV "pvc-069128c6ccdc11e8": VA "csi-3b15269e725f727786c5aec3b4da3f2eebc2477dec53d3480a3fe1dd01adea53" found
I1011 19:15:15.269955       1 connection.go:238] GRPC response:
I1011 19:15:15.269986       1 connection.go:239] GRPC error: rpc error: code = Internal desc = Could not attach volume "vol-0ea6117ddb69e78fb" to node "i-06d0e08c9565c4db7": could not attach volume "vol-0ea6117ddb69e78fb" to node "i-06d0e08c9565c4db7": InvalidVolume.NotFound: The volume 'vol-0ea6117ddb69e78fb' does not exist.
        status code: 400, request id: 634b33d1-71cb-4901-8ee0-98933d2a5b47
I1011 19:15:15.269998       1 csi_handler.go:320] Saving attach error to "csi-3b15269e725f727786c5aec3b4da3f2eebc2477dec53d3480a3fe1dd01adea53"
I1011 19:15:15.274440       1 csi_handler.go:330] Saved attach error to "csi-3b15269e725f727786c5aec3b4da3f2eebc2477dec53d3480a3fe1dd01adea53"
I1011 19:15:15.274464       1 csi_handler.go:86] Error processing "csi-3b15269e725f727786c5aec3b4da3f2eebc2477dec53d3480a3fe1dd01adea53": failed to attach: rpc error: code = Internal desc = Could not attach volume "vol-0ea6117ddb69e78fb" to node "i-06d0e08c9565c4db7": could not attach volume "vol-0ea6117ddb69e78fb" to node "i-06d0e08c9565c4db7": InvalidVolume.NotFound: The volume 'vol-0ea6117ddb69e78fb' does not exist.
        status code: 400, request id: 634b33d1-71cb-4901-8ee0-98933d2a5b47
I1011 19:15:15.274505       1 controller.go:167] Started VA processing "csi-3b15269e725f727786c5aec3b4da3f2eebc2477dec53d3480a3fe1dd01adea53"
I1011 19:15:15.274516       1 csi_handler.go:76] CSIHandler: processing VA "csi-3b15269e725f727786c5aec3b4da3f2eebc2477dec53d3480a3fe1dd01adea53"
I1011 19:15:15.274522       1 csi_handler.go:103] Attaching "csi-3b15269e725f727786c5aec3b4da3f2eebc2477dec53d3480a3fe1dd01adea53"
I1011 19:15:15.274528       1 csi_handler.go:208] Starting attach operation for "csi-3b15269e725f727786c5aec3b4da3f2eebc2477dec53d3480a3fe1dd01adea53"
I1011 19:15:15.274536       1 csi_handler.go:320] Saving attach error to "csi-3b15269e725f727786c5aec3b4da3f2eebc2477dec53d3480a3fe1dd01adea53"
I1011 19:15:15.278318       1 csi_handler.go:330] Saved attach error to "csi-3b15269e725f727786c5aec3b4da3f2eebc2477dec53d3480a3fe1dd01adea53"
I1011 19:15:15.278339       1 csi_handler.go:86] Error processing "csi-3b15269e725f727786c5aec3b4da3f2eebc2477dec53d3480a3fe1dd01adea53": failed to attach: PersistentVolume "pvc-069128c6ccdc11e8" is marked for deletion
I1011 20:37:23.328696       1 controller.go:167] Started VA processing "csi-3b15269e725f727786c5aec3b4da3f2eebc2477dec53d3480a3fe1dd01adea53"
I1011 20:37:23.328709       1 csi_handler.go:76] CSIHandler: processing VA "csi-3b15269e725f727786c5aec3b4da3f2eebc2477dec53d3480a3fe1dd01adea53"
I1011 20:37:23.328715       1 csi_handler.go:103] Attaching "csi-3b15269e725f727786c5aec3b4da3f2eebc2477dec53d3480a3fe1dd01adea53"
I1011 20:37:23.328721       1 csi_handler.go:208] Starting attach operation for "csi-3b15269e725f727786c5aec3b4da3f2eebc2477dec53d3480a3fe1dd01adea53"
I1011 20:37:23.328730       1 csi_handler.go:320] Saving attach error to "csi-3b15269e725f727786c5aec3b4da3f2eebc2477dec53d3480a3fe1dd01adea53"
I1011 20:37:23.330919       1 reflector.go:286] github.com/kubernetes-csi/external-attacher/vendor/k8s.io/client-go/informers/factory.go:87: forcing resync
I1011 20:37:23.330975       1 controller.go:197] Started PV processing "pvc-069128c6ccdc11e8"
I1011 20:37:23.330990       1 csi_handler.go:350] CSIHandler: processing PV "pvc-069128c6ccdc11e8"
I1011 20:37:23.331030       1 csi_handler.go:386] CSIHandler: processing PV "pvc-069128c6ccdc11e8": VA "csi-3b15269e725f727786c5aec3b4da3f2eebc2477dec53d3480a3fe1dd01adea53" found
I1011 20:37:23.346007       1 csi_handler.go:330] Saved attach error to "csi-3b15269e725f727786c5aec3b4da3f2eebc2477dec53d3480a3fe1dd01adea53"
I1011 20:37:23.346033       1 csi_handler.go:86] Error processing "csi-3b15269e725f727786c5aec3b4da3f2eebc2477dec53d3480a3fe1dd01adea53": failed to attach: PersistentVolume "pvc-069128c6ccdc11e8" is marked for deletion
I1011 20:37:23.346069       1 controller.go:167] Started VA processing "csi-3b15269e725f727786c5aec3b4da3f2eebc2477dec53d3480a3fe1dd01adea53"
I1011 20:37:23.346077       1 csi_handler.go:76] CSIHandler: processing VA "csi-3b15269e725f727786c5aec3b4da3f2eebc2477dec53d3480a3fe1dd01adea53"
I1011 20:37:23.346082       1 csi_handler.go:103] Attaching "csi-3b15269e725f727786c5aec3b4da3f2eebc2477dec53d3480a3fe1dd01adea53"
I1011 20:37:23.346088       1 csi_handler.go:208] Starting attach operation for "csi-3b15269e725f727786c5aec3b4da3f2eebc2477dec53d3480a3fe1dd01adea53"
I1011 20:37:23.346096       1 csi_handler.go:320] Saving attach error to "csi-3b15269e725f727786c5aec3b4da3f2eebc2477dec53d3480a3fe1dd01adea53"
I1011 20:37:23.351068       1 csi_handler.go:330] Saved attach error to "csi-3b15269e725f727786c5aec3b4da3f2eebc2477dec53d3480a3fe1dd01adea53"
I1011 20:37:23.351090       1 csi_handler.go:86] Error processing "csi-3b15269e725f727786c5aec3b4da3f2eebc2477dec53d3480a3fe1dd01adea53": failed to attach: PersistentVolume "pvc-069128c6ccdc11e8" is marked for deletion



md5-0ecba2ca3eeb5f8f706855a1aab137ec



>> kk get pv
NAME                   CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS        CLAIM            STORAGECLASS   REASON   AGE
pvc-069128c6ccdc11e8   4Gi        RWO            Delete           Terminating   default/claim1   late-sc                 22h

>> kk describe pv
Name:            pvc-069128c6ccdc11e8
Labels:          <none>
Annotations:     pv.kubernetes.io/provisioned-by: com.amazon.aws.csi.ebs
Finalizers:      [external-attacher/com-amazon-aws-csi-ebs]
StorageClass:    late-sc
Status:          Terminating (lasts <invalid>)
Claim:           default/claim1
Reclaim Policy:  Delete
Access Modes:    RWO
Capacity:        4Gi
Node Affinity:   <none>
Message:
Source:
    Type:              CSI (a Container Storage Interface (CSI) volume source)
    Driver:            com.amazon.aws.csi.ebs
    VolumeHandle:      vol-0ea6117ddb69e78fb
    ReadOnly:          false
    VolumeAttributes:      storage.kubernetes.io/csiProvisionerIdentity=1539123546345-8081-com.amazon.aws.csi.ebs
Events:                <none>



md5-c393f812fcca227c5f23cbdfe14fb64f



kind: StorageClass
apiVersion: storage.k8s.io/v1                                                                                                                                                                                                                             metadata:
  name: late-sc
provisioner: com.amazon.aws.csi.ebs
volumeBindingMode: WaitForFirstConsumer



md5-72cecb7f9aec5f931cde75b7239ebbf3



apiVersion: v1                                                                                                                                                                                                                                            kind: PersistentVolumeClaim
metadata:
  name: claim1
spec:                                                                                                                                                                                                                                                       accessModes:
    - ReadWriteOnce
  storageClassName: late-sc
  resources:                                                                                                                                                                                                                                                  requests:
      storage: 4Gi

What you expected to happen:
After PVC is deleted, PV should be deleted along with the EBS volume (since my Reclaim Policy is delete)

How to reproduce it (as minimally and precisely as possible):
Non-deterministic so far

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): client: v1.12.0 server: v1.12.1
  • Cloud provider or hardware configuration: aws
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools: cluster is set up using kops
  • Others:
kinbug sistorage

Most helpful comment

I got rid of this issue by performing the following actions:
Chandras-MacBook-Pro:kubernetes-kafka chandra$ kubectl get pv | grep "kafka/" pvc-5124cf7a-e9dc-11e8-93a1-02551748eea0 1Gi RWO Retain Bound kafka/data-pzoo-0 kafka-zookeeper 21h pvc-639023b2-e9dc-11e8-93a1-02551748eea0 1Gi RWO Retain Bound kafka/data-pzoo-1 kafka-zookeeper 21h pvc-7d88b184-e9dc-11e8-93a1-02551748eea0 1Gi RWO Retain Bound kafka/data-pzoo-2 kafka-zookeeper 21h pvc-9ea68541-e9dc-11e8-93a1-02551748eea0 100Gi RWO Delete Terminating kafka/data-kafka-0 kafka-broker 21h pvc-ae795177-e9dc-11e8-93a1-02551748eea0 100Gi RWO Delete Terminating kafka/data-kafka-1 kafka-broker 21h
Then I manually edited the pv individually and then removing the finalizers which looked something like this:
```finalizers:

  • kubernetes.io/pv-protection```

Once done, the PVs those were in Terminating condition were all gone!!!

All 59 comments

Several questions:
1) how can I get out of this situation?
2) should PV be terminated successfully after driver returns success even when the volume is already gone?

/sig storage

It looks like you PV still has a finalizer from the attacher. Can you verify that the volume got successfully detached from the node?

It may be good to get logs from the external-attacher, and also the AD controller

cc @jsafrane

What version of the external-attacher are you using?

it's v0.3.0. And all the other side cars are in v0.3.0 as well. I was using v0.4.0 earlier and this issue happen after I recreate the side cars in v0.3.0.

Updated the description with attacher log

It looks like you PV still has a finalizer from the attacher. Can you verify that the volume got successfully detached from the node?

The volume should be detached successfully. Since it is successfully deleted from AWS (don't think it could be deleted without detaching). Also verified on the node that the device is gone using lsblk.

It looks like the volume was marked for deletion before an attach ever succeeded. Maybe there is some bug with handling that scenario.

Do you still see a VolumeAttachment object?

Do you still see a VolumeAttachment object?

How can I check this?

kubectl get volumeattachment

Yep. Its still there:

>> kubectl get volumeattachment
NAME                                                                   CREATED AT
csi-3b15269e725f727786c5aec3b4da3f2eebc2477dec53d3480a3fe1dd01adea53   2018-10-10T22:30:09Z

Reading the logs, it seems like A/D controller tried to attach the volume and got error from external attacher. Why it did not delete the VolumeAttachment afterwards? Do you still have a pod that uses the volume? If so, it blocks PV deletion.

There is no pod uses the volume. And PVC is gone also. How can I find A/D controller log?

It's on the master node, controller-manager.log. You can try to filter by searching for the volume name.

Here is the controller log:

E1011 19:14:10.336074       1 daemon_controller.go:304] default/csi-node failed with : error storing status for daemon set &v1.DaemonSet{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, O
bjectMeta:v1.ObjectMeta{Name:"csi-node", GenerateName:"", Namespace:"default", SelfLink:"/apis/apps/v1/namespaces/default/daemonsets/csi-node", UID:"d4e56145-cd89-11e8-9e90-0abab70c948
0", ResourceVersion:"467814", Generation:1, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63674882050, loc:(*time.Location)(0x5b9b560)}}, DeletionTimestamp:(*v1.Time)(nil), De
letionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string{"deprecated.daemonset.template.generation":"1"}, OwnerReferences:[]v1.OwnerReferenc
e(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Spec:v1.DaemonSetSpec{Selector:(*v1.LabelSelector)(0xc4233ac360), Template:v1.PodTemplateSpec{O
bjectMeta:v1.ObjectMeta{Name:"", GenerateName:"", Namespace:"", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*t
ime.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{"app":"csi-node"}, Annotations:map[string]string(nil), Owner
References:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Spec:v1.PodSpec{Volumes:[]v1.Volume{v1.Volume{Name:"kubelet-dir",
VolumeSource:v1.VolumeSource{HostPath:(*v1.HostPathVolumeSource)(0xc4233ac380), EmptyDir:(*v1.EmptyDirVolumeSource)(nil), GCEPersistentDisk:(*v1.GCEPersistentDiskVolumeSource)(nil), AW
SElasticBlockStore:(*v1.AWSElasticBlockStoreVolumeSource)(nil), GitRepo:(*v1.GitRepoVolumeSource)(nil), Secret:(*v1.SecretVolumeSource)(nil), NFS:(*v1.NFSVolumeSource)(nil), ISCSI:(*v1
.ISCSIVolumeSource)(nil), Glusterfs:(*v1.GlusterfsVolumeSource)(nil), PersistentVolumeClaim:(*v1.PersistentVolumeClaimVolumeSource)(nil), RBD:(*v1.RBDVolumeSource)(nil), FlexVolume:(*v
1.FlexVolumeSource)(nil), Cinder:(*v1.CinderVolumeSource)(nil), CephFS:(*v1.CephFSVolumeSource)(nil), Flocker:(*v1.FlockerVolumeSource)(nil), DownwardAPI:(*v1.DownwardAPIVolumeSource)(
nil), FC:(*v1.FCVolumeSource)(nil), AzureFile:(*v1.AzureFileVolumeSource)(nil), ConfigMap:(*v1.ConfigMapVolumeSource)(nil), VsphereVolume:(*v1.VsphereVirtualDiskVolumeSource)(nil), Quobyte:(*v1.QuobyteVolumeSource)(nil), AzureDisk:(*v1.AzureDiskVolumeSource)(nil), PhotonPersistentDisk:(*v1.PhotonPersistentDiskVolumeSource)(nil), Projected:(*v1.ProjectedVolumeSource)(nil), PortworxVolume:(*v1.PortworxVolumeSource)(nil), ScaleIO:(*v1.ScaleIOVolumeSource)(nil), StorageOS:(*v1.StorageOSVolumeSource)(nil)}}, v1.Volume{Name:"plugin-dir", VolumeSource:v1.VolumeSource{HostPath:(*v1.HostPathVolumeSource)(0xc4233ac3a0), EmptyDir:(*v1.EmptyDirVolumeSource)(nil), GCEPersistentDisk:(*v1.GCEPersistentDiskVolumeSource)(nil), AWSElasticBlockStore:(*v1.AWSElasticBlockStoreVolumeSource)(nil), GitRepo:(*v1.GitRepoVolumeSource)(nil), Secret:(*v1.SecretVolumeSource)(nil), NFS:(*v1.NFSVolumeSource)(nil), ISCSI:(*v1.ISCSIVolumeSource)(nil), Glusterfs:(*v1.GlusterfsVolumeSource)(nil), PersistentVolumeClaim:(*v1.PersistentVolumeClaimVolumeSource)(nil), RBD:(*v1.RBDVolumeSource)(nil), FlexVolume:(*v1.FlexVolumeSource)(nil), Cinder:(*v1.CinderVolumeSource)(nil), CephFS:(*v1.CephFSVolumeSource)(nil), Flocker:(*v1.FlockerVolumeSource)(nil), DownwardAPI:(*v1.DownwardAPIVolumeSource)(nil), FC:(*v1.FCVolumeSource)(nil), AzureFile:(*v1.AzureFileVolumeSource)(nil), ConfigMap:(*v1.ConfigMapVolumeSource)(nil), VsphereVolume:(*v1.VsphereVirtualDiskVolumeSource)(nil), Quobyte:(*v1.QuobyteVolumeSource)(nil), AzureDisk:(*v1.AzureDiskVolumeSource)(nil), PhotonPersistentDisk:(*v1.PhotonPersistentDiskVolumeSource)(nil), Projected:(*v1.ProjectedVolumeSource)(nil), PortworxVolume:(*v1.PortworxVolumeSource)(nil), ScaleIO:(*v1.ScaleIOVolumeSource)(nil), StorageOS:(*v1.StorageOSVolumeSource)(nil)}}, v1.Volume{Name:"device-dir", VolumeSource:v1.VolumeSource{HostPath:(*v1.HostPathVolumeSource)(0xc4233ac3c0), EmptyDir:(*v1.EmptyDirVolumeSource)(nil), GCEPersistentDisk:(*v1.GCEPersistentDiskVolumeSource)(nil), AWSElasticBlockStore:(*v1.AWSElasticBlockStoreVolumeSource)(nil), GitRepo:(*v1.GitRepoVolumeSource)(nil), Secret:(*v1.SecretVolumeSource)(nil), NFS:(*v1.NFSVolumeSource)(nil), ISCSI:(*v1.ISCSIVolumeSource)(nil), Glusterfs:(*v1.GlusterfsVolumeSource)(nil), PersistentVolumeClaim:(*v1.PersistentVolumeClaimVolumeSource)(nil), RBD:(*v1.RBDVolumeSource)(nil), FlexVolume:(*v1.FlexVolumeSource)(nil), Cinder:(*v1.CinderVolumeSource)(nil), CephFS:(*v1.CephFSVolumeSource)(nil), Flocker:(*v1.FlockerVolumeSource)(nil), DownwardAPI:(*v1.DownwardAPIVolumeSource)(nil), FC:(*v1.FCVolumeSource)(nil), AzureFile:(*v1.AzureFileVolumeSource)(nil), ConfigMap:(*v1.ConfigMapVolumeSource)(nil), VsphereVolume:(*v1.VsphereVirtualDiskVolumeSource)(nil), Quobyte:(*v1.QuobyteVolumeSource)(nil), AzureDisk:(*v1.AzureDiskVolumeSource)(nil), PhotonPersistentDisk:(*v1.PhotonPersistentDiskVolumeSource)(nil), Projected:(*v1.ProjectedVolumeSource)(nil), PortworxVolume:(*v1.PortworxVolumeSource)(nil), ScaleIO:(*v1.ScaleIOVolumeSource)(nil), StorageOS:(*v1.StorageOSVolumeSource)(nil)}}}, InitContainers:[]v1.Container(nil), Containers:[]v1.Container{v1.Container{Name:"csi-driver-registrar", Image:"quay.io/k8scsi/driver-registrar:v0.3.0", Command:[]string(nil), Args:[]string{"--v=5", "--csi-address=$(ADDRESS)"}, WorkingDir:"", Ports:[]v1.ContainerPort(nil), EnvFrom:[]v1.EnvFromSource(nil), Env:[]v1.EnvVar{v1.EnvVar{Name:"ADDRESS", Value:"/csi/csi.sock", ValueFrom:(*v1.EnvVarSource)(nil)}, v1.EnvVar{Name:"KUBE_NODE_NAME", Value:"", ValueFrom:(*v1.EnvVarSource)(0xc4233ac400)}}, Resources:v1.ResourceRequirements{Limits:v1.ResourceList(nil), Requests:v1.ResourceList(nil)}, VolumeMounts:[]v1.VolumeMount{v1.VolumeMount{Name:"plugin-dir", ReadOnly:false, MountPath:"/csi", SubPath:"", MountPropagation:(*v1.MountPropagationMode)(nil)}}, VolumeDevices:[]v1.VolumeDevice(nil), LivenessProbe:(*v1.Probe)(nil), ReadinessProbe:(*v1.Probe)(nil), Lifecycle:(*v1.Lifecycle)(nil), TerminationMessagePath:"/dev/termination-log", TerminationMessagePolicy:"File", ImagePullPolicy:"Always", SecurityContext:(*v1.SecurityContext)(0xc422ccc050), Stdin:false, StdinOnce:false, TTY:false}, v1.Container{Name:"ebs-plugin", Image:"quay.io/bertinatto/ebs-csi-driver:testing", Command:[]string(nil), Args:[]string{"--endpoint=$(CSI_ENDPOINT)", "--logtostderr", "--v=5"}, WorkingDir:"", Ports:[]v1.ContainerPort(nil), EnvFrom:[]v1.EnvFromSource(nil), Env:[]v1.EnvVar{v1.EnvVar{Name:"CSI_ENDPOINT", Value:"unix:/csi/csi.sock", ValueFrom:(*v1.EnvVarSource)(nil)}, v1.EnvVar{Name:"AWS_ACCESS_KEY_ID", Value:"", ValueFrom:(*v1.EnvVarSource)(0xc4233ac460)}, v1.EnvVar{Name:"AWS_SECRET_ACCESS_KEY", Value:"", ValueFrom:(*v1.EnvVarSource)(0xc4233ac480)}}, Resources:v1.ResourceRequirements{Limits:v1.ResourceList(nil), Requests:v1.ResourceList(nil)}, VolumeMounts:[]v1.VolumeMount{v1.VolumeMount{Name:"kubelet-dir", ReadOnly:false, MountPath:"/var/lib/kubelet", SubPath:"", MountPropagation:(*v1.MountPropagationMode)(0xc422c717e0)}, v1.VolumeMount{Name:"plugin-dir", ReadOnly:false, MountPath:"/csi", SubPath:"", MountPropagation:(*v1.MountPropagationMode)(nil)}, v1.VolumeMount{Name:"device-dir", ReadOnly:false, MountPath:"/dev", SubPath:"", MountPropagation:(*v1.MountPropagationMode)(nil)}}, VolumeDevices:[]v1.VolumeDevice(nil), LivenessProbe:(*v1.Probe)(nil), ReadinessProbe:(*v1.Probe)(nil), Lifecycle:(*v1.Lifecycle)(nil), TerminationMessagePath:"/dev/termination-log", TerminationMessagePolicy:"File", ImagePullPolicy:"Always", SecurityContext:(*v1.SecurityContext)(0xc422ccc0f0), Stdin:false, StdinOnce:false, TTY:false}}, RestartPolicy:"Always", TerminationGracePeriodSeconds:(*int64)(0xc422d68b30), ActiveDeadlineSeconds:(*int64)(nil), DNSPolicy:"ClusterFirst", NodeSelector:map[string]string(nil), ServiceAccountName:"csi-node-sa", DeprecatedServiceAccount:"csi-node-sa", AutomountServiceAccountToken:(*bool)(nil), NodeName:"", HostNetwork:true, HostPID:false, HostIPC:false, ShareProcessNamespace:(*bool)(nil), SecurityContext:(*v1.PodSecurityContext)(0xc42325ec60), ImagePullSecrets:[]v1.LocalObjectReference(nil), Hostname:"", Subdomain:"", Affinity:(*v1.Affinity)(nil), SchedulerName:"default-scheduler", Tolerations:[]v1.Toleration(nil), HostAliases:[]v1.HostAlias(nil), PriorityClassName:"", Priority:(*int32)(nil), DNSConfig:(*v1.PodDNSConfig)(nil), ReadinessGates:[]v1.PodReadinessGate(nil), RuntimeClassName:(*string)(nil)}}, UpdateStrategy:v1.DaemonSetUpdateStrategy{Type:"RollingUpdate", RollingUpdate:(*v1.RollingUpdateDaemonSet)(0xc424139a40)}, MinReadySeconds:0, RevisionHistoryLimit:(*int32)(0xc422d68b38)}, Status:v1.DaemonSetStatus{CurrentNumberScheduled:2, NumberMisscheduled:0, DesiredNumberScheduled:3, NumberReady:0, ObservedGeneration:1, UpdatedNumberScheduled:2, NumberAvailable:0, NumberUnavailable:3, CollisionCount:(*int32)(nil), Conditions:[]v1.DaemonSetCondition(nil)}}: Operation cannot be fulfilled on daemonsets.apps "csi-node": the object has been modified; please apply your changes to the latest version and try again
I1011 19:15:14.740106       1 pv_controller.go:601] volume "pvc-069128c6ccdc11e8" is released and reclaim policy "Delete" will be executed
I1011 19:15:14.756316       1 pv_controller.go:824] volume "pvc-069128c6ccdc11e8" entered phase "Released"
I1011 19:15:14.759557       1 pv_controller.go:1294] isVolumeReleased[pvc-069128c6ccdc11e8]: volume is released
I1011 19:15:14.939461       1 pv_controller.go:1294] isVolumeReleased[pvc-069128c6ccdc11e8]: volume is released
I1011 19:15:14.954828       1 pv_controller.go:1294] isVolumeReleased[pvc-069128c6ccdc11e8]: volume is released

The last line got repeated infinitely.

I encountered this issue two more times now. All on v1.12

I got rid of this issue by performing the following actions:
Chandras-MacBook-Pro:kubernetes-kafka chandra$ kubectl get pv | grep "kafka/" pvc-5124cf7a-e9dc-11e8-93a1-02551748eea0 1Gi RWO Retain Bound kafka/data-pzoo-0 kafka-zookeeper 21h pvc-639023b2-e9dc-11e8-93a1-02551748eea0 1Gi RWO Retain Bound kafka/data-pzoo-1 kafka-zookeeper 21h pvc-7d88b184-e9dc-11e8-93a1-02551748eea0 1Gi RWO Retain Bound kafka/data-pzoo-2 kafka-zookeeper 21h pvc-9ea68541-e9dc-11e8-93a1-02551748eea0 100Gi RWO Delete Terminating kafka/data-kafka-0 kafka-broker 21h pvc-ae795177-e9dc-11e8-93a1-02551748eea0 100Gi RWO Delete Terminating kafka/data-kafka-1 kafka-broker 21h
Then I manually edited the pv individually and then removing the finalizers which looked something like this:
```finalizers:

  • kubernetes.io/pv-protection```

Once done, the PVs those were in Terminating condition were all gone!!!

Answer of @chandraprakash1392 is still valid when pvc stucks also in Terminating status.
You just need to edit the pvc object and remove finalizers object.

Removing the finalizers is just a workaround. @bertinatto @leakingtapan could you help repro this issue and save detailed CSI driver and controller-manager logs?

examples removing for finalizers

kubectl patch pvc db-pv-claim -p '{"metadata":{"finalizers":null}}'
kubectl patch pod db-74755f6698-8td72 -p '{"metadata":{"finalizers":null}}'

then you can delete them

!!! IMPORTANT !!!:
Read also https://github.com/kubernetes/kubernetes/issues/78106
The patch commands are a workaround and something is not working properly.
Ther volumes are still attached: kubectl get volumeattachments!

Removing the finalizers is just a workaround. @bertinatto @leakingtapan could you help repro this issue and save detailed CSI driver and controller-manager logs?

I managed to reproduce it after a few tries, although the log messages seem a bit different from the ones reported by @leakingtapan:

Plugin (provisioner): https://gist.github.com/bertinatto/16f5c1f76b1c2577cd66dbedfa4e0c7c
Plugin (attacher): https://gist.github.com/bertinatto/25ebd591ffc88d034f5b4419c0bfa040
Controller manager: https://gist.github.com/bertinatto/a2d82fdbccbf7ec0bb5e8ab65d47dcf3

Same here, had to delete the finalizer, here's a describe for the pv:

[root@ip-172-31-44-98 stateful]# k describe pv pvc-1c6625e2-1157-11e9-a8fc-0275b365cbce Name: pvc-1c6625e2-1157-11e9-a8fc-0275b365cbce Labels: failure-domain.beta.kubernetes.io/region=us-east-1 failure-domain.beta.kubernetes.io/zone=us-east-1a Annotations: kubernetes.io/createdby: aws-ebs-dynamic-provisioner pv.kubernetes.io/bound-by-controller: yes pv.kubernetes.io/provisioned-by: kubernetes.io/aws-ebs Finalizers: [kubernetes.io/pv-protection] StorageClass: default Status: Terminating (lasts <invalid>) Claim: monitoring/storage-es-data-0 Reclaim Policy: Delete Access Modes: RWO Capacity: 12Gi Node Affinity: <none> Message: Source: Type: AWSElasticBlockStore (a Persistent Disk resource in AWS) VolumeID: aws://us-east-1a/vol-0a20e4f50b60df855 FSType: ext4 Partition: 0 ReadOnly: false Events: <none>

Reading the logs, it seems like A/D controller tried to attach the volume and got error from external attacher. Why it did not delete the VolumeAttachment afterwards? Do you still have a pod that uses the volume? If so, it blocks PV deletion.

@jsafrane I only have one pod, and I delete the PVC after Pod is deleted

I am able to reproduce the issue consistently. This happens when AttachVolume fails.

To reproduce, I created a three node k8s cluster. And I create a static provisioned PV after creating a EBS volume manually. Deploy the specs:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv-test2
  annotations:
    pv.kubernetes.io/provisioned-by: ebs.csi.aws.com
spec:
  accessModes:
    - ReadWriteOnce
  capacity:
    storage: 1Gi
  csi:
    driver: ebs.csi.aws.com
    fsType: ext3
    volumeHandle: vol-0e850f49c7f6aeff0
  persistentVolumeReclaimPolicy: Delete
  volumeMode: Filesystem
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: claim1
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: ""
  resources:
    requests:
      storage: 1Gi
---
apiVersion: v1
kind: Pod
metadata:
  name: app
spec:
  containers:
  - name: app
    image: centos:7
    command: ["/bin/sh"]
    args: ["-c", "while true; do echo $(date -u) >> /data/out.txt; sleep 5; done"]
    volumeMounts:
    - name: persistent-storage
      mountPath: /data
  volumes:
  - name: persistent-storage
    persistentVolumeClaim:
      claimName: claim1

When Pod creation failed due to zone mismatch, delete the Pod and then delete the PVC. And PV will be stuck at terminating state. I reproduced this in v1.13

VolumeAttachment Object:

apiVersion: v1
items:
- apiVersion: storage.k8s.io/v1
  kind: VolumeAttachment
  metadata:
    annotations:
      csi.alpha.kubernetes.io/node-id: i-056c2441224e9988f
    creationTimestamp: 2019-01-07T20:41:36Z
    finalizers:
    - external-attacher/ebs-csi-aws-com
    name: csi-92e3ad4dcfc45346bd1efae253530bb83f34c7cf0ecb3e58da0cf97645de2e54
    resourceVersion: "390447"
    selfLink: /apis/storage.k8s.io/v1/volumeattachments/csi-92e3ad4dcfc45346bd1efae253530bb83f34c7cf0ecb3e58da0cf97645de2e54
    uid: a0632702-12bc-11e9-844f-0a6817dc5d60
  spec:
    attacher: ebs.csi.aws.com
    nodeName: ip-172-20-106-186.ec2.internal
    source:
      persistentVolumeName: pv-test2
  status:
    attachError:
      message: PersistentVolume "pv-test2" is marked for deletion
      time: 2019-01-07T21:17:27Z
    attached: false
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

I think the lingering volumeattachment object is caused when csi_attacher.go failed to attach the volume, the volumeattachment object is not deleted

We could delete the object when waitForVolumeAttachment fails. How do you think? I will be happy to send out the fix.

/cc @jsafrane @msau42

I reproduced the issue using non-existent volume id (attach fails with "The volume 'vol-01213456789' does not exist").

We could delete the object when waitForVolumeAttachment fails.

This is very tricky. There can be two kinds of attach failures:

  1. "permanent", the volume cannot be ever attached: EBS volume does not exists, it has wrong zone, IAM does not have enough permissions, ...
  2. "temporary", the volume may be attached in (near) future: timeout, the volume is being detached from other node

When waitForVolumeAttachment fails, A/D controller does not know what kind of failure it was. External attacher still tries to attach the volume in parallel, hoping that it's going to succeed eventually.

Especially when the attacher timed out (volume was "Attaching" for a longer time), you can see that AWS CSI driver does not detach a volume when waitForAttachmentState fails:

https://sourcegraph.com/github.com/kubernetes-sigs/aws-ebs-csi-driver@ff1fe8e1399784657c10d67649146429dcb93515/-/blob/pkg/cloud/cloud.go#L300

That means that AWS still tries to attach the volume despite the CSI driver returned timeout. Next ControllerPublish should resume waitForAttachmentState where it left before. At this point A/D controller should not delete VolumeAttachment, because that would detach the volume. A/D controller would try to re-attach the volume in a short while (i.e. create new VolumeAttachment) and the volume would start "Attaching" "from scratch" and it's likely to time out again, creating a loop of VolumeAttachment creation/deletion, but never waiting long enough for the attachment to succeed.

We have several options:

1 A/D controller remembers the volume is being attached - it's neither attached (pod cannot be started) nor detached (Detach() must be called when the volume disappears from DSW). This could be quite complicated, especially when reconstructing this state after controller restart.

  1. CSI volume plugin can delete VolumeAttachment when waitForVolumeAttachment fails, but then we expect that ControllerPublish/Unpublish is quite fast (15s by default) and does not need to be "resumed" as implemented now and described above. Increasing 15s timeout might help in this case (but it's hardcoded in CSI volume plugin and not configurable).

cc @gnufied @bertinatto

@jingxu97, do I remember correctly that there was something in volume reconstruction that made volumes neither mounted nor unmounted in VolumeManager? Maybe we need something similar here.

@jsafrane seems like it related to this https://github.com/kubernetes/kubernetes/pull/71276

@jingxu97 Indeed, #71276 helped, thanks a lot.

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

just ran into this issue and had to user kubectl edit pv|pvc and remove finalizers in order to trash the objects

@pniederlag is this reproducible? If so can you provide PVC details? Were all pods using that PVC deleted as well?

@pnegahdar just want to make sure which version of kubernetes you are using

@pniederlag is this reproducible? If so can you provide PVC details? Were all pods using that PVC deleted as well?

@msau42 unfortunatly I don't have a scenario to easily reproduce this. I am still new to kubernetes and finding my way on handling volumes with pods and run into numerous problems on getting the pod stuck attaching the volume (backed by an azure disk). So my use case probably was not a sane one (using terraform and kubernetes dashboard in parallel).

examples removing for finalizers

kubectl patch pvc db-pv-claim -p '{"metadata":{"finalizers":null}}'
kubectl patch pod db-74755f6698-8td72 -p '{"metadata":{"finalizers":null}}'

then you can delete them

thank you, it work for me

Looking at the error, I suspect this is related to #73098. During namespace deletion, it appears the deletionTimestamp on leftover resources is continually updated, causing conflict errors when the controller tries to remove its finalizer. Looks like the fix for this problem is in 1.14+

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

Experiencing on v1.14.6. reported my issue here initially: https://github.com/longhorn/longhorn/issues/722 as I am also using Longhorn. Longhorn appears to do what it's supposed to by deleting the volume. Attempting to delete the pvc via the API results in the pod stuck in removing status waiting on kubernetes.io/pv-protection

"name": "pvc-dadb69dd-04a2-11ea-9e5f-005056a36714",
"persistentVolumeReclaimPolicy": "Delete",
"removed": "2019-11-12T16:46:04Z",
"removedTS": 1573577164000,
"state": "removing",
"status": {
"phase": "Bound",
"type": "/v3/cluster/schemas/persistentVolumeStatus"
},
"storageClassId": "longhorn-new",
"transitioning": "error",
"transitioningMessage": "waiting on kubernetes.io/pv-protection",
"type": "persistentVolume",
"uuid": "dc89634d-04a2-11ea-98c4-005056a34210",
"volumeMode": "Filesystem"
}

I got rid of this issue by performing the following actions:

pvc-5124cf7a-e9dc-11e8-93a1-02551748eea0   1Gi        RWO            Retain           Bound         kafka/data-pzoo-0                                         kafka-zookeeper             21h
pvc-639023b2-e9dc-11e8-93a1-02551748eea0   1Gi        RWO            Retain           Bound         kafka/data-pzoo-1                                         kafka-zookeeper             21h
pvc-7d88b184-e9dc-11e8-93a1-02551748eea0   1Gi        RWO            Retain           Bound         kafka/data-pzoo-2                                         kafka-zookeeper             21h
pvc-9ea68541-e9dc-11e8-93a1-02551748eea0   100Gi      RWO            Delete           Terminating   kafka/data-kafka-0                                        kafka-broker                21h
pvc-ae795177-e9dc-11e8-93a1-02551748eea0   100Gi      RWO            Delete           Terminating   kafka/data-kafka-1                                        kafka-broker                21h

Then I manually edited the pv individually and then removing the finalizers which looked something like this:

  - kubernetes.io/pv-protection```

Once done, the PVs those were in Terminating condition were all gone!!!

Use the below command to edit the pv and then delete the finalizers object in the definiton
kubectl edit pv pv-name-id

Any real solution so far? I remember seeing this issue in the first months of 2019, and still see the patching of finalizers as the only workaround. Unfortunately, this "bug" prohibits me from automating certain stuff.

Do someone has a diagnostic about this? Any insights? After 7 months, this thing's still on the wild and I got no more info 7 months later.

There have been various different root causes that have been fixed.

To debug any additional issues, we need repro steps, more detailed logs from kube-controller-manager to see what pv protection controller is doing, and potentially more logs from other controllers/drivers to debug further.

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

Still happening in 1.17.5

kube-controller-manager-ip-172-16-113-117.us-west-2.compute.internal kube-controller-manager E0421 23:53:59.806754       1 aws.go:2567] Error describing volume "vol-0a2f0e84304490c43": "InvalidVolume.NotFound: The volume 'vol-0a2f0e84304490c43' does not exist.\n\tstatus code: 400, request id: 58776735-a463-40f3-ae7b-95d602e2a466"
kube-controller-manager-ip-172-16-113-117.us-west-2.compute.internal kube-controller-manager E0421 23:53:59.806791       1 aws.go:2299] InvalidVolume.NotFound: The volume 'vol-0a2f0e84304490c43' does not exist.
kube-controller-manager-ip-172-16-113-117.us-west-2.compute.internal kube-controller-manager    status code: 400, request id: 58776735-a463-40f3-ae7b-95d602e2a466
kube-controller-manager-ip-172-16-113-117.us-west-2.compute.internal kube-controller-manager I0421 23:53:59.806802       1 aws.go:1965] Releasing in-process attachment entry: bd -> volume vol-0a2f0e84304490c43
kube-controller-manager-ip-172-16-113-117.us-west-2.compute.internal kube-controller-manager E0421 23:53:59.806809       1 attacher.go:86] Error attaching volume "aws://us-west-2a/vol-0a2f0e84304490c43" to node "ip-172-16-112-89.us-west-2.compute.internal": InvalidVolume.NotFound: The volume 'vol-0a2f0e84304490c43' does not exist.
kube-controller-manager-ip-172-16-113-117.us-west-2.compute.internal kube-controller-manager    status code: 400, request id: 58776735-a463-40f3-ae7b-95d602e2a466

I also got hit by this. The PV had the finalizer "foregroundDeletion" in place. Editing the PV and removing the finalizer allowed the PV to terminate.

Will we eventually stop playing around with these finalizers with manual edit ?

If there are pods or jobs currently using the PV, i found that deleting them solves the issue for me

same issue, had to manually remove - kubernetes.io/pv-protection to have the PV's deleted. This happened with AKS on k8s 1.17.9

same issue happened. I have deleted ebs manually from the aws and then I tried to delete pv but it stuck in terminating status.

pvc-60fbc6ab-8732-4d1e-ae32-b42295553fa1 95Gi RWO Delete Terminating ray-prod/data-newcrate-1 gp2 5d4h

I just had a nearly identical issue happen to me. I'm on Kubernetes 1.19.0 on Centos 7, mounting a TerraMaster NAS share via NFS. I had just removed all PV and PVC as I was testing the creation of those two types of items in preparation for a helm install. When I tried to remove them, they hung. I also had to manually edit the pv to remove the finalizer (also kubernetes.io/pv-protection as andrei-dascalu), and then it finally deleted.

Same issue here:

Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.2", GitCommit:"52c56ce7a8272c798dbc29846288d7cd9fbae032", GitTreeState:"clean", BuildDate:"2020-04-16T11:56:40Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"dff82dc0de47299ab66c83c626e08b245ab19037", GitTreeState:"clean", BuildDate:"2020-07-15T16:51:04Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

storageClass: longhorn

Solved the issue of PVC and PV stucked in "terminating" state by deleting the pod using it.

No, when this happens to my PV, the pod no longer exists.

On Friday, September 25, 2020, 04:26:04 AM CDT, lsambolino <[email protected]> wrote:

Solved the issue of PVC and PV stucked in "terminating" state by deleting the pod using it.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.

I had multiple pvcs stuck in terminating status.
kubectl describe pvcname (To get pod its attached to.)
kubectl patch pvc pvcname -p '{"metadata":{"finalizers":null}}'
kubectl patch pod podname -p '{"metadata":{"finalizers":null}}'
This worked in my K8S cluster

Thanks for posting these commands to get rid of the pvc

@wolfewicz @DMXGuru if pods are deleted, pvc should not stuck in terminating states. User should not need to remove finalizer manually.
Could you reproduce your case and give some details here, so that we can help triage?

How and what details would you like? The kubectl commands and output showing this behavior and then a kubectl describe and kubectl get -o yaml for the resultant PV?

Sent from my iPhone

On Oct 8, 2020, at 14:30, Jing Xu notifications@github.com wrote:


@wolfewicz @DMXGuru if pods are deleted, pvc should not stuck in terminating states. User should not need to remove finalizer manually.
Could you reproduce your case and give some details here, so that we can help triage?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.

@DMXGuru first thing I want to verify is that there are no pods running and no VolumeSnapshots are taken during PVC/PV terminating.

kubectl describe pod | grep ClaimName
kubectl describe volumesnapshot | grep persistentVolumeClaimName

Second, could you describe in what sequence did you perform pod or pvc deletion? Thanks!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

theothermike picture theothermike  ·  3Comments

ttripp picture ttripp  ·  3Comments

arun-gupta picture arun-gupta  ·  3Comments

Seb-Solon picture Seb-Solon  ·  3Comments

rhohubbuild picture rhohubbuild  ·  3Comments