Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endless loop on terraform destroy with nodepool in state "FAILED_DESTROYING" #582

Closed
salyh opened this issue Jun 19, 2024 · 6 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@salyh
Copy link

salyh commented Jun 19, 2024

Description

I provisioned nodepools via terraform. After the nodepool is active I want to destroy them via terraform destroy.
Then terraform runs for more than 30 minutes trying to destroy the nodepool. Looking into the DCD UI i see that the status is "FAILED_DESTROYING" (see screenshot).

When I now cancel the terraform run via STRG-C and rerun it the nodepool get really destroyed in a matter of a few seconds.

Expected behavior

Properly destroy nodepools in state "FAILED_DESTROYING" without the need of a terraform destroy re-run

Environment

Terraform version:

OpenTofu v1.7.2

Provider version:

v6.4.17

OS:

References

#579

Monosnap Monosnap 2024-06-19 11-49-48
@salyh salyh added the bug Something isn't working label Jun 19, 2024
@adeatcu-ionos
Copy link
Contributor

adeatcu-ionos commented Jun 27, 2024

Hello! Can you provide the Terraform plan that led to this situation? I want to reproduce this scenario.

Our Terraform provider only sends the DELETE request and then waits for the resource to be deleted. In the case you described, the resource reached FAILED_DESTROYING state for some reason (API-related) so the resource wasn't deleted and the loop kept going. It's not an endless loop, it's a loop that has a specific timeout.

What I think it happened in this scenario (I still need to reproduce this to be sure):

  • terraform destroy sends the first DELETE request;
  • DELETE request fails, the API sets the nodepool to FAILED_DESTROYING state;
  • Terraform periodically checks the deletion of the resource (the loop) but since the resource is always there, in the FAILED_DESTROYING state, Terraform keeps on checking;
  • you cancel the previous command and then run terraform destroy again which sends another DELETE request, this final DELETE request successfully deletes the nodepool;

Related to the description from Expected behavior:

Properly destroy nodepools in state "FAILED_DESTROYING" without the need of a terraform destroy re-run

As I said above, the provider only sends DELETE requests to the API and waits for the resource to be deleted, the deletion process is handled by the API. If the API sets the resource in FAILED_DESTROYING state, it means that something went wrong during the deletion, it has nothing to do with the TF provider, the provider only sends the requests.

The provider did the job, the DELETE request was sent, the fact that the resource was in FAILED_DESTROYING has nothing to do with the provider. You choose to run terraform destroy again, so basically you sent another DELETE request on a resource that was in FAILED_DESTROYING and somehow it worked, but this is solely related to the API.

@salyh
Copy link
Author

salyh commented Jun 27, 2024

It's not an endless loop, it's a loop that has a specific timeout.

What is the timeout?

The provider did the job, the DELETE request was sent, the fact that the resource was in FAILED_DESTROYING has nothing to do with the provider. You choose to run terraform destroy again, so basically you sent another DELETE request on a resource that was in FAILED_DESTROYING and somehow it worked, but this is solely related to the API.

Mhh, that is passing the buck to and from each other. I think the provider needs to handle API failures appropriately.

For the plan and logs please refer to Internal support ticket Ticket 207171709

@adeatcu-ionos
Copy link
Contributor

adeatcu-ionos commented Jun 27, 2024

@salyh

What is the timeout?

3 hours, I will also leave a reference to that.

Mhh, that is passing the buck to and from each other. I think the provider needs to handle API failures appropriately.

It really isn't, besides interrupting the loop when the resource reaches a FAILED_DESTROYING state (I will analyze the implication of this) I don't think that something useful can be implemented. The provider is responsible with taking the data from the tf file and sending it to the API in case of a create/update command and with sending a DELETE request in case of a resource deletion. Let's say that, in the provider, when you receive a FAILED_DESTROYING we send a DELETE request again. There is no guarantee that the result will be different. Also, how many requests should we send before understanding that something is really not working inside the API?

The DELETE request should be done once and the API should take care of it. The provider only needs to check that the resource is properly deleted before informing the user and reflecting that change in the tf state.

@salyh
Copy link
Author

salyh commented Jun 27, 2024

I disagree - the provider can (and should) indeed send multiple DELETE requests because:

  • It wouldn't do any harm
  • DELETE is idempotent
  • deal better with api failures

@adeatcu-ionos
Copy link
Contributor

@salyh I agree with the points presented above but still, sending multiple DELETE requests doesn't guarantee that your issue will be solved since we are talking about: "Properly destroy nodepools in state "FAILED_DESTROYING" without the need of a terraform destroy re-run". The resource was in FAILED_DESTROYING, I didn't test this but from my experience with other resources, if the deletion failed in the first place, it will fail again, I'm not sure how this worked for you, I didn't play a lot with this API.

I will discuss with my colleagues from the API about this and if indeed multiple DELETE requests can lead to the deletion of the resource (even if the resource is in FAILED_DESTRYOING state), we will implement this mechanism but I'll rather see it as a feature (having the possibility to define a number of retries for a specific request) instead of a bug-solving fix.

@cristiGuranIonos
Copy link
Collaborator

terraform will throw an error on reaching any of these states, as they are not recoverable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants