-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GCS backend: Detect and force-unlock stale locks. #17470
base: main
Are you sure you want to change the base?
Conversation
Thanks for the PR @ubschmidt2! This seems like a reasonable feature to me, but it does require a little more care around the lock management than in the "fail locked" case that we have now, and for some cases the architecture of terraform itself prevents us from handling it very well at all. The main situation to consider, is that now that we have locks available, users are putting this into automation and blocking sometimes concurrent operations using the lock. If the client holding the lock loses connectivity temporarily, this can cause a waiting instance to possible incorrectly obtain the lock (granted, losing connectivity for over 4min, then regaining it and completing the run would be a very rare case). Optimally terraform core could detect this and abort from the unsafe situation, but that isn't really possible with the current architecture. The consul backend (in which the lock always requires active connectivity) does 2 things to try and protect the user however -- it knows when the lock was lost and can report an error to the user, and it uses a CAS operation to ensure that the final state being overwritten is the the correct version. These things are tradeoff's of course, as if there are two instances applying at the same time, there is technically no "correct" state to write at the end, but I think at least reporting the situation to the user will alert them that there could be inconsistencies. So, after that background from my adventures in state locking, I'll add the rest of the comments inline in the review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One other possible request -- would it be possible to store the initial Generation of the state file in the remoteClient
? Then the client could compare and set the final state against the initial Generation, to add another layer of protection against overwriting the state incorrectly.
Thanks!
Thanks, James, your review comments are very helpful. I'm working on the suggested improvements. |
I've updated the pull request. |
FWIW, the current Travis failure is unrelated to this PR.
|
PTAL, if you have a chance. |
1 similar comment
PTAL, if you have a chance. |
This change fixes the issue of orphaned lock files, which Terraform leaves behind in some scenarios. Such lock files had to be removed manually, either by means of a force-unlock or by deleting the file in GCS. The "updated" timestamp of the lock file is used as an indicator of staleness. This timestamp is updated once per minute as long as the lock is held. A lock file is considered stale and is force-unlocked if its timestamp hasn't been updated for several minutes.
In particular: - added a mutex to remoteClient to prevent concurrent modifications - refactored the background heartbeating - added/improved log messages
Do not force-unlock locks created by clients that don't perform heartbeating on the lock file.
…n a separate call.
5f8aa3d
to
6c8b42a
Compare
50649a2
to
ab8d83b
Compare
Hi @ubschmidt2, Sorry about the long delay here, but we're nearing completion on the long 0.12 road and I'm going over PRs that were blocked by that work. Can you rebase on master and re-run the tests? |
Ah, sorry, James, missed your comment. Yes, I'll do that. Also, still busy here with adoption of 0.12. ;-) |
Do we know if this is still an issue with the current backend? I have added it to the list of PRs for the GCS team to triage. Thanks! |
This change fixes the issue of orphaned lock files, which Terraform leaves
behind in some scenarios. Such lock files had to be removed manually, either
by means of a force-unlock or by deleting the file in GCS.
The "updated" timestamp of the lock file is used
as an indicator of staleness. This timestamp is updated once per minute as long
as the lock is held. A lock file is considered stale and is force-unlocked if
its timestamp hasn't been updated for several minutes.