GCS backend: Detect and force-unlock stale locks. #17470

ubschmidt2 · 2018-03-01T00:10:52Z

This change fixes the issue of orphaned lock files, which Terraform leaves
behind in some scenarios. Such lock files had to be removed manually, either
by means of a force-unlock or by deleting the file in GCS.

The "updated" timestamp of the lock file is used
as an indicator of staleness. This timestamp is updated once per minute as long
as the lock is held. A lock file is considered stale and is force-unlocked if
its timestamp hasn't been updated for several minutes.

ubschmidt2 · 2018-03-01T00:20:37Z

@octo @danawillow

jbardin · 2018-03-06T00:14:20Z

Thanks for the PR @ubschmidt2!

This seems like a reasonable feature to me, but it does require a little more care around the lock management than in the "fail locked" case that we have now, and for some cases the architecture of terraform itself prevents us from handling it very well at all.

The main situation to consider, is that now that we have locks available, users are putting this into automation and blocking sometimes concurrent operations using the lock. If the client holding the lock loses connectivity temporarily, this can cause a waiting instance to possible incorrectly obtain the lock (granted, losing connectivity for over 4min, then regaining it and completing the run would be a very rare case).

Optimally terraform core could detect this and abort from the unsafe situation, but that isn't really possible with the current architecture. The consul backend (in which the lock always requires active connectivity) does 2 things to try and protect the user however -- it knows when the lock was lost and can report an error to the user, and it uses a CAS operation to ensure that the final state being overwritten is the the correct version.

These things are tradeoff's of course, as if there are two instances applying at the same time, there is technically no "correct" state to write at the end, but I think at least reporting the situation to the user will alert them that there could be inconsistencies.

So, after that background from my adventures in state locking, I'll add the rest of the comments inline in the review.

jbardin

One other possible request -- would it be possible to store the initial Generation of the state file in the remoteClient? Then the client could compare and set the final state against the initial Generation, to add another layer of protection against overwriting the state incorrectly.

Thanks!

backend/remote-state/gcs/client.go

ubschmidt2 · 2018-03-07T23:11:19Z

Thanks, James, your review comments are very helpful. I'm working on the suggested improvements.

ubschmidt2 · 2018-06-20T15:29:42Z

I've updated the pull request.

ubschmidt2 · 2018-06-21T09:38:31Z

FWIW, the current Travis failure is unrelated to this PR.

# github.com/hashicorp/terraform/command
command/output_test.go:253:15: unknown field 'ContextOpts' in struct literal of type Meta
command/output_test.go:281:15: unknown field 'ContextOpts' in struct literal of type Meta

ubschmidt2 · 2018-09-11T12:17:29Z

PTAL, if you have a chance.

ubschmidt2 · 2018-10-29T12:05:10Z

PTAL, if you have a chance.

This change fixes the issue of orphaned lock files, which Terraform leaves behind in some scenarios. Such lock files had to be removed manually, either by means of a force-unlock or by deleting the file in GCS. The "updated" timestamp of the lock file is used as an indicator of staleness. This timestamp is updated once per minute as long as the lock is held. A lock file is considered stale and is force-unlocked if its timestamp hasn't been updated for several minutes.

In particular: - added a mutex to remoteClient to prevent concurrent modifications - refactored the background heartbeating - added/improved log messages

Do not force-unlock locks created by clients that don't perform heartbeating on the lock file.

…n a separate call.

… file.

jbardin · 2019-02-21T21:57:42Z

Hi @ubschmidt2,

Sorry about the long delay here, but we're nearing completion on the long 0.12 road and I'm going over PRs that were blocked by that work.

Can you rebase on master and re-run the tests?

ubschmidt2 · 2019-06-26T17:57:41Z

Ah, sorry, James, missed your comment. Yes, I'll do that. Also, still busy here with adoption of 0.12. ;-)

hashicorp-cla · 2022-03-12T17:39:23Z

All committers have signed the CLA.

crw · 2024-08-28T17:25:07Z

Do we know if this is still an issue with the current backend? I have added it to the list of PRs for the GCS team to triage. Thanks!

jbardin requested changes Mar 6, 2018

View reviewed changes

backend/remote-state/gcs/client.go Outdated Show resolved Hide resolved

backend/remote-state/gcs/client.go Outdated Show resolved Hide resolved

backend/remote-state/gcs/client.go Outdated Show resolved Hide resolved

backend/remote-state/gcs/client.go Outdated Show resolved Hide resolved

ubschmidt2 force-pushed the gcs_locking branch from 8797354 to df19af4 Compare June 21, 2018 09:27

ubschmidt2 force-pushed the gcs_locking branch from 2a7c09d to 76affbd Compare June 28, 2018 11:51

apparentlymart added bug backend/gcs labels Jul 10, 2018

ubschmidt2 added 10 commits November 5, 2018 12:01

Addressed review comments.

c4ec04b

In particular: - added a mutex to remoteClient to prevent concurrent modifications - refactored the background heartbeating - added/improved log messages

Reduce diff by undoing a few lines of refactorings.

e9df8f3

Facilitate a safe migration path.

a9ae772

Do not force-unlock locks created by clients that don't perform heartbeating on the lock file.

Fix typo.

9f6dcf3

Use the well-known x-goog-meta prefix for metadata headers.

6e6040a

No need to pass around the lock file handle as a parameter.

111fe6d

Use the OAuth scope that is required for storage.objects.update.

e536d07

Set the metadata header when creating the lock file, not afterwards i…

eccf30c

…n a separate call.

Get backend tests building again after the state manager refactoring

6c8b42a

ubschmidt2 force-pushed the gcs_locking branch from 5f8aa3d to 6c8b42a Compare November 5, 2018 11:33

Introduce configuration knobs for determining the staleness of a lock…

ab8d83b

… file.

ubschmidt2 force-pushed the gcs_locking branch from 50649a2 to ab8d83b Compare November 5, 2018 15:35

Base automatically changed from master to main February 24, 2021 18:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GCS backend: Detect and force-unlock stale locks. #17470

GCS backend: Detect and force-unlock stale locks. #17470

ubschmidt2 commented Mar 1, 2018

ubschmidt2 commented Mar 1, 2018

jbardin commented Mar 6, 2018

jbardin left a comment

ubschmidt2 commented Mar 7, 2018

ubschmidt2 commented Jun 20, 2018

ubschmidt2 commented Jun 21, 2018

ubschmidt2 commented Sep 11, 2018

ubschmidt2 commented Oct 29, 2018

jbardin commented Feb 21, 2019

ubschmidt2 commented Jun 26, 2019

hashicorp-cla commented Mar 12, 2022 •

edited

Loading

crw commented Aug 28, 2024

GCS backend: Detect and force-unlock stale locks. #17470

Are you sure you want to change the base?

GCS backend: Detect and force-unlock stale locks. #17470

Conversation

ubschmidt2 commented Mar 1, 2018

ubschmidt2 commented Mar 1, 2018

jbardin commented Mar 6, 2018

jbardin left a comment

Choose a reason for hiding this comment

ubschmidt2 commented Mar 7, 2018

ubschmidt2 commented Jun 20, 2018

ubschmidt2 commented Jun 21, 2018

ubschmidt2 commented Sep 11, 2018

ubschmidt2 commented Oct 29, 2018

jbardin commented Feb 21, 2019

ubschmidt2 commented Jun 26, 2019

hashicorp-cla commented Mar 12, 2022 • edited Loading

crw commented Aug 28, 2024

hashicorp-cla commented Mar 12, 2022 •

edited

Loading