Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCS backend: Detect and force-unlock stale locks. #17470

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

ubschmidt2
Copy link
Contributor

This change fixes the issue of orphaned lock files, which Terraform leaves
behind in some scenarios. Such lock files had to be removed manually, either
by means of a force-unlock or by deleting the file in GCS.

The "updated" timestamp of the lock file is used
as an indicator of staleness. This timestamp is updated once per minute as long
as the lock is held. A lock file is considered stale and is force-unlocked if
its timestamp hasn't been updated for several minutes.

@ubschmidt2
Copy link
Contributor Author

@octo @danawillow

@jbardin
Copy link
Member

jbardin commented Mar 6, 2018

Thanks for the PR @ubschmidt2!

This seems like a reasonable feature to me, but it does require a little more care around the lock management than in the "fail locked" case that we have now, and for some cases the architecture of terraform itself prevents us from handling it very well at all.

The main situation to consider, is that now that we have locks available, users are putting this into automation and blocking sometimes concurrent operations using the lock. If the client holding the lock loses connectivity temporarily, this can cause a waiting instance to possible incorrectly obtain the lock (granted, losing connectivity for over 4min, then regaining it and completing the run would be a very rare case).

Optimally terraform core could detect this and abort from the unsafe situation, but that isn't really possible with the current architecture. The consul backend (in which the lock always requires active connectivity) does 2 things to try and protect the user however -- it knows when the lock was lost and can report an error to the user, and it uses a CAS operation to ensure that the final state being overwritten is the the correct version.

These things are tradeoff's of course, as if there are two instances applying at the same time, there is technically no "correct" state to write at the end, but I think at least reporting the situation to the user will alert them that there could be inconsistencies.

So, after that background from my adventures in state locking, I'll add the rest of the comments inline in the review.

Copy link
Member

@jbardin jbardin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other possible request -- would it be possible to store the initial Generation of the state file in the remoteClient? Then the client could compare and set the final state against the initial Generation, to add another layer of protection against overwriting the state incorrectly.

Thanks!

@ubschmidt2
Copy link
Contributor Author

Thanks, James, your review comments are very helpful. I'm working on the suggested improvements.

@ubschmidt2
Copy link
Contributor Author

I've updated the pull request.

@ubschmidt2
Copy link
Contributor Author

FWIW, the current Travis failure is unrelated to this PR.

# github.com/hashicorp/terraform/command
command/output_test.go:253:15: unknown field 'ContextOpts' in struct literal of type Meta
command/output_test.go:281:15: unknown field 'ContextOpts' in struct literal of type Meta

@ubschmidt2
Copy link
Contributor Author

PTAL, if you have a chance.

1 similar comment
@ubschmidt2
Copy link
Contributor Author

PTAL, if you have a chance.

This change fixes the issue of orphaned lock files, which Terraform leaves
behind in some scenarios. Such lock files had to be removed manually, either
by means of a force-unlock or by deleting the file in GCS.

The "updated" timestamp of the lock file is used
as an indicator of staleness. This timestamp is updated once per minute as long
as the lock is held. A lock file is considered stale and is force-unlocked if
its timestamp hasn't been updated for several minutes.
In particular:
- added a mutex to remoteClient to prevent concurrent modifications
- refactored the background heartbeating
- added/improved log messages
Do not force-unlock locks created by clients that don't perform heartbeating
on the lock file.
@jbardin
Copy link
Member

jbardin commented Feb 21, 2019

Hi @ubschmidt2,

Sorry about the long delay here, but we're nearing completion on the long 0.12 road and I'm going over PRs that were blocked by that work.

Can you rebase on master and re-run the tests?

@ubschmidt2
Copy link
Contributor Author

Ah, sorry, James, missed your comment. Yes, I'll do that. Also, still busy here with adoption of 0.12. ;-)

Base automatically changed from master to main February 24, 2021 18:01
@hashicorp-cla
Copy link

hashicorp-cla commented Mar 12, 2022

CLA assistant check
All committers have signed the CLA.

@crw
Copy link
Contributor

crw commented Aug 28, 2024

Do we know if this is still an issue with the current backend? I have added it to the list of PRs for the GCS team to triage. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants