[Release-3.14.1][Bug] Fix cfn-hup endless loop after rollback to cluster state older than 24 hours #7146

hehe7318 · 2025-12-09T18:19:32Z

Description of changes

When a cluster update fails and triggers a rollback to a state older than 24 hours, cfn-hup enters an endless loop on the head node. This happens because:

The rollback restores the launch template metadata to reference an expired wait condition handle (wait conditions expire after 24h)
cfn-signal fails to signal the expired handle and returns non-zero
cfn-hup sees the non-zero exit code and does not update its local metadata cache (metadata_db.json)
On the next polling interval, cfn-hup detects the same "change" and re-triggers the update recipe, creating an infinite loop

This fix appends || exit 0 to the update command, ensuring cfn-hup always updates its metadata cache regardless of whether cfn-signal succeeds or fails. This prevents the endless loop while still allowing CloudFormation to handle timeouts appropriately.

Tests

TODO: manually test.
TODO: integ test.

Checklist

Make sure you are pointing to the right branch.
If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
Check all commits' messages are clear, describing what and why vs how.
Make sure to have added unit tests or integration tests to cover the new/modified code.
Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

When a cluster update fails and triggers a rollback to a state older than 24 hours, cfn-hup enters an endless loop on the head node. This happens because: 1. The rollback restores the launch template metadata to reference an expired wait condition handle (wait conditions expire after 24h) 2. cfn-signal fails to signal the expired handle and returns non-zero 3. cfn-hup sees the non-zero exit code and does not update its local metadata cache (metadata_db.json) 4. On the next polling interval, cfn-hup detects the same "change" and re-triggers the update recipe, creating an infinite loop This fix appends "; exit 0" to the update command, ensuring cfn-hup always updates its metadata cache regardless of whether cfn-signal succeeds or fails. This prevents the endless loop while still allowing CloudFormation to handle timeouts appropriately.

hehe7318 requested a review from a team as a code owner December 9, 2025 18:19

hehe7318 added the 3.x label Dec 9, 2025

hehe7318 requested a review from a team as a code owner December 9, 2025 18:19

Use || exit rather than ; exit

80ece6e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Release-3.14.1][Bug] Fix cfn-hup endless loop after rollback to cluster state older than 24 hours #7146

[Release-3.14.1][Bug] Fix cfn-hup endless loop after rollback to cluster state older than 24 hours #7146

Uh oh!

hehe7318 commented Dec 9, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[Release-3.14.1][Bug] Fix cfn-hup endless loop after rollback to cluster state older than 24 hours #7146

Are you sure you want to change the base?

[Release-3.14.1][Bug] Fix cfn-hup endless loop after rollback to cluster state older than 24 hours #7146

Uh oh!

Conversation

hehe7318 commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of changes

Tests

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hehe7318 commented Dec 9, 2025 •

edited

Loading