Skip to content

Conversation

@hehe7318
Copy link
Contributor

@hehe7318 hehe7318 commented Dec 9, 2025

Description of changes

When a cluster update fails and triggers a rollback to a state older than 24 hours, cfn-hup enters an endless loop on the head node. This happens because:

  1. The rollback restores the launch template metadata to reference an expired wait condition handle (wait conditions expire after 24h)
  2. cfn-signal fails to signal the expired handle and returns non-zero
  3. cfn-hup sees the non-zero exit code and does not update its local metadata cache (metadata_db.json)
  4. On the next polling interval, cfn-hup detects the same "change" and re-triggers the update recipe, creating an infinite loop

This fix appends || exit 0 to the update command, ensuring cfn-hup always updates its metadata cache regardless of whether cfn-signal succeeds or fails. This prevents the endless loop while still allowing CloudFormation to handle timeouts appropriately.

Tests

  • TODO: manually test.
  • TODO: integ test.

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

When a cluster update fails and triggers a rollback to a state older than
24 hours, cfn-hup enters an endless loop on the head node. This happens
because:

1. The rollback restores the launch template metadata to reference an
   expired wait condition handle (wait conditions expire after 24h)
2. cfn-signal fails to signal the expired handle and returns non-zero
3. cfn-hup sees the non-zero exit code and does not update its local
   metadata cache (metadata_db.json)
4. On the next polling interval, cfn-hup detects the same "change" and
   re-triggers the update recipe, creating an infinite loop

This fix appends "; exit 0" to the update command, ensuring cfn-hup
always updates its metadata cache regardless of whether cfn-signal
succeeds or fails. This prevents the endless loop while still allowing
CloudFormation to handle timeouts appropriately.
@hehe7318 hehe7318 requested a review from a team as a code owner December 9, 2025 18:19
@hehe7318 hehe7318 added the 3.x label Dec 9, 2025
@hehe7318 hehe7318 requested a review from a team as a code owner December 9, 2025 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant