-
Notifications
You must be signed in to change notification settings - Fork 110
feature(aws): log fallback to ssm based access #12298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds SSM (AWS Systems Manager) fallback support for log collection when SSH access fails on AWS instances. The implementation enables SSM agent on instances during provisioning and provides a boto3-based SSM command runner that can be used as an alternative to SSH-based remote execution.
Key changes:
- Added SSM agent enablement in cloud-init configuration scripts
- Implemented
SSMCommandRunnerclass with boto3 for executing commands via SSM - Enhanced log collection to automatically fall back to SSM when SSH fails on AWS instances
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| sdcm/utils/aws_ssm_runner.py | New module implementing SSM command execution via boto3 |
| unit_tests/test_aws_ssm_runner.py | Comprehensive unit tests for the SSM runner |
| sdcm/utils/aws_region.py | Added SSM configuration method to region setup |
| sdcm/provision/aws/utils.py | Added script to enable SSM agent on instances |
| sdcm/provision/aws/configuration_script.py | Integrated SSM agent enablement into cloud-init |
| sdcm/logcollector.py | Enhanced log collection with SSM fallback for AWS nodes |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
@scylladb/qa-maintainers please give a high level look, |
|
We could wait for sct-agent and install it as the first thing. |
We don't know that, until we see logs. One agent or other, this is an agent we have in place, and we need to figure out this issue, it popping up several times a week now |
3f1265f to
391f845
Compare
since we are running into some issue we are failing to have ssh access but zero logs cause of it (we have multiple time during the years, credentails is or other cloud-init/boot issues) in this change we are gonna make sure ssm-agents are working on our instances, and fallback to log during log collection if we can't have ssh access * added it to the regio configuration to enable it * added it the top of the cloud-init to unmask the agent see: scylladb/scylla-machine-image@b8e494d * `SSMCommandRunner` which have `run()` api as with our ssh based remoters * `CommandLog` collection is falling back to use `SSMCommandRunner` Ref: scylladb#11581 Update sdcm/utils/aws_region.py Co-authored-by: Copilot <[email protected]> Update sdcm/provision/aws/utils.py Co-authored-by: Copilot <[email protected]> Update sdcm/utils/aws_ssm_runner.py Co-authored-by: Copilot <[email protected]>
391f845 to
2f34375
Compare
|
@scylladb/qa-maintainers it would be nice if we can get this review and push, it would help for investigation of ssh connectivity issue we are face across the board |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.
| except (ImportError, AttributeError, TypeError, ValueError, KeyError, IndexError) as e: | ||
| LOGGER.error("Failed to run SSM command: %s", e) | ||
| return None, is_file_remote | ||
| return log_filename if ok else None, is_file_remote |
Copilot
AI
Oct 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable ok is only defined within the try-except blocks (lines 681 and 701), but it's referenced outside those blocks on line 709. If an exception occurs in line 682-683 before ok is assigned, or if neither SSH nor SSM paths execute, this will raise an UnboundLocalError. Initialize ok = False at the start of the method to ensure it always has a value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
asssuming it works, LGTM
As I said, this is enabled with cloud-init anyway, so we could think of using sct-agent for that in future (this would work for all cloud backends then).
I guess for all the backend we'll support our own agent, we'll use that first, and it won't be a fallback from ssh. |
since we are running into some issue we are failing to have ssh access but zero logs cause of it (we have multiple time during the years, credentails is or other cloud-init/boot issues)
in this change we are gonna make sure ssm-agents are working on our instances, and fallback to log during log collection if we can't have ssh access
SSMCommandRunnerwhich haverun()api as with our ssh based remotersCommandLogcollection is falling back to useSSMCommandRunnerRef: #11581
TODO
Testing
PR pre-checks (self review)
backportlabelsReminders
sdcm/sct_config.py)unit-test/folder)