Trident: Does not return after logging "Error identifying update scenario."

Created on 2 Jul 2021  ·  5Comments  ·  Source: NetApp/trident

Describe the bug
We found the code that does not return after receiving error.
https://github.com/NetApp/trident/blob/stable/v21.04/operator/controllers/orchestrator/controller.go#L1177-L1184
In that case currentInstalledTridentVersion and tridentK8sConfigVersion are empty, but these variables are used in the following code.
I'm not sure this is a bug, but if not returning after error, the variables might be used unexpectedly.

Environment

  • Trident version: v21.04
bug tracked

Most helpful comment

This issue is fixed with commit bf43f8 and is included in the Trident 21.10.0 release.

All 5 comments

Hello @takuhiro,

Thank you for this query.
Failure to get currentInstalledTridentVersion and tridentK8sConfigVersion from existing deployments or daemonset means something is wrong e.g. this information is not set properly, missing, has been tempered, possibly both deployment and daemonset are not present, basically, something is wrong.

An ideal thing to do would be to re-create these objects. Now there are two empty variables here:

  1. tridentK8sConfigVersion, when this is returned empty, Trident fails to identify if this is an upgrade scenario or not. So, in this scenario, your Trident installation is not marked for upgrade (here).
  2. currentInstalledTridentVersion, this variable is used for setting the version in the CR status, and used exactly once in the installer.go for comparing existing trident version with the version of the trident offered by the image (here). In this scenario this condition should fail and mark Trident for re-installation.

The expectation here is that the re-installation would self-heal the deployment and the daemonset such that in the next iteration version information can be properly retrieved.

@ntap-arorar Thank you for the detailed answer. I understood the logic when tridentK8sConfigVersion and currentInstalledTridentVersion are empty and that would be resolved in the next iteration.

In our case, the complete error message was this. (I'm sorry I should have include the context first.)

time="2021-06-26T14:56:52Z" level=error msg="Error identifying update scenario." controllingCR=trident err="unable to get list of deployments"

The error unable to get list of deployments was made here. The root cause of the error was probably a temporary GET request error to api-server here, which could lead unnecessary re-installation. In this case, I think it would be better to return and retry in the next iteration.

I believe there is a scope of making this logic more resilient to unintended re-installs due to temporary issues (false positive) in the Kubernetes environment. This does depend on the error type being returned otherwise the Operator may miss on healing real Trident installations that are in bad shape.

I agree that the error should be handled by the error type. If the error type represents temporary issues, the controller can skip re-installation because the error would be resolved soon in the next iteration.

This issue is fixed with commit bf43f8 and is included in the Trident 21.10.0 release.

Was this page helpful?
0 / 5 - 0 ratings