Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ CHANGELOG
- Reduce EFA installation time for Ubuntu by ~20 minutes by only holding kernel packages for the installed kernel.
- Add GetFunction and GetPolicy permissions to PClusterBuildImageCleanupRole to prevent AccessDenied errors during build image stack deletion.
- Fix validation error messages when `DevSettings` is null or `DevSettings/InstanceTypesData` is missing required fields.
- Fix an issue where cfn-hup enters an endless loop on the head node after a rollback to a cluster state older than 24 hours, caused by cfn-signal failing to signal an expired wait condition handle.

3.14.0
------
Expand Down
7 changes: 7 additions & 0 deletions cli/src/pcluster/templates/cluster_stack.py
Original file line number Diff line number Diff line change
Expand Up @@ -1496,6 +1496,12 @@ def _add_head_node(self):
"chefUpdate": {
"commands": {
"chef": {
# This command runs the update recipe and signals CloudFormation with the result.
# The trailing "|| exit 0" ensures cfn-hup always updates its local metadata cache
# (metadata_db.json) regardless of whether cfn-signal succeeds or fails.
# Without this, if cfn-signal fails (e.g., due to an expired wait condition handle
# after a rollback to a state older than 24h), cfn-hup would not update its cache
# and would enter an endless loop, re-triggering the update recipe every minute.
"command": (
". /etc/parallelcluster/pcluster_cookbook_environment.sh; "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[FUTURE IMPROVEMENTS] In future we should consider to move this block of code to a script in the head node and simply use this metadata command to pass the wait condition handle as an argument. This would not only simplify the testing but also the mitigation on a running environment.

This is out of scope for the current PR and we do not want to change it in 3.14.1.
However, what do you think about it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes a lot of sense. Better readable and maintenance. Also easier to test. Should I add a TODO in the PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to track it into our backlog of internal features

"cinc-client --local-mode --config /etc/chef/client.rb --log_level info"
Expand All @@ -1509,6 +1515,7 @@ def _add_head_node(self):
f" $CFN_BOOTSTRAP_VIRTUALENV_PATH/cfn-signal --exit-code=1 --reason='Update failed'"
f" --region {self.stack.region} --url {cloudformation_url}"
f" '{self.wait_condition_handle.ref}'"
" || exit 0"
),
"cwd": "/etc/chef",
}
Expand Down
Loading