feature(aws): log fallback to ssm based access #12298

fruch · 2025-10-23T10:28:28Z

since we are running into some issue we are failing to have ssh access but zero logs cause of it (we have multiple time during the years, credentails is or other cloud-init/boot issues)

in this change we are gonna make sure ssm-agents are working on our instances, and fallback to log during log collection if we can't have ssh access

added it to the regio configuration to enable it
added it the top of the cloud-init to unmask the agent see: scylladb/scylla-machine-image@b8e494d
SSMCommandRunner which have run() api as with our ssh based remoters
CommandLog collection is falling back to use SSMCommandRunner

Ref: #11581

TODO

configure all regions

Testing

locally - tested SSM implementation via actual machines, and region configuration code
aws provision
locally hardcode the fallback - to validate it's working

PR pre-checks (self review)

I added the relevant backport labels
I didn't leave commented-out/debugging code

Reminders

Add New configuration option and document them (in sdcm/sct_config.py)
Add unit tests to cover my changes (under unit-test/ folder)
Update the Readme/doc folder relevant to this change (if needed)

Copilot

Pull Request Overview

This PR adds SSM (AWS Systems Manager) fallback support for log collection when SSH access fails on AWS instances. The implementation enables SSM agent on instances during provisioning and provides a boto3-based SSM command runner that can be used as an alternative to SSH-based remote execution.

Key changes:

Added SSM agent enablement in cloud-init configuration scripts
Implemented SSMCommandRunner class with boto3 for executing commands via SSM
Enhanced log collection to automatically fall back to SSM when SSH fails on AWS instances

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
sdcm/utils/aws_ssm_runner.py	New module implementing SSM command execution via boto3
unit_tests/test_aws_ssm_runner.py	Comprehensive unit tests for the SSM runner
sdcm/utils/aws_region.py	Added SSM configuration method to region setup
sdcm/provision/aws/utils.py	Added script to enable SSM agent on instances
sdcm/provision/aws/configuration_script.py	Integrated SSM agent enablement into cloud-init
sdcm/logcollector.py	Enhanced log collection with SSM fallback for AWS nodes

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

sdcm/provision/aws/utils.py

sdcm/utils/aws_region.py

sdcm/logcollector.py

sdcm/utils/aws_region.py

sdcm/utils/aws_ssm_runner.py

fruch · 2025-10-23T10:31:50Z

@scylladb/qa-maintainers please give a high level look,
I'm still thinking on how to well enough test the fallback logic

soyacz · 2025-10-23T11:58:37Z

We could wait for sct-agent and install it as the first thing.
Genreally, ssh issuue is more AWS fault and machine is not responding at all - so we wont have logs anyway.

fruch · 2025-10-23T12:14:12Z

We could wait for sct-agent and install it as the first thing.
Genreally, ssh issuue is more AWS fault and machine is not responding at all - so we wont have logs anyway.

We don't know that, until we see logs.

One agent or other, this is an agent we have in place, and we need to figure out this issue, it popping up several times a week now

since we are running into some issue we are failing to have ssh access but zero logs cause of it (we have multiple time during the years, credentails is or other cloud-init/boot issues) in this change we are gonna make sure ssm-agents are working on our instances, and fallback to log during log collection if we can't have ssh access * added it to the regio configuration to enable it * added it the top of the cloud-init to unmask the agent see: scylladb/scylla-machine-image@b8e494d * `SSMCommandRunner` which have `run()` api as with our ssh based remoters * `CommandLog` collection is falling back to use `SSMCommandRunner` Ref: scylladb#11581 Update sdcm/utils/aws_region.py Co-authored-by: Copilot <[email protected]> Update sdcm/provision/aws/utils.py Co-authored-by: Copilot <[email protected]> Update sdcm/utils/aws_ssm_runner.py Co-authored-by: Copilot <[email protected]>

fruch · 2025-10-28T14:08:51Z

@scylladb/qa-maintainers it would be nice if we can get this review and push, it would help for investigation of ssh connectivity issue we are face across the board

Copilot

Pull Request Overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

sdcm/logcollector.py

Copilot · 2025-10-28T20:52:38Z

sdcm/logcollector.py

+            except (ImportError, AttributeError, TypeError, ValueError, KeyError, IndexError) as e:
+                LOGGER.error("Failed to run SSM command: %s", e)
+                return None, is_file_remote
+        return log_filename if ok else None, is_file_remote


The variable ok is only defined within the try-except blocks (lines 681 and 701), but it's referenced outside those blocks on line 709. If an exception occurs in line 682-683 before ok is assigned, or if neither SSH nor SSM paths execute, this will raise an UnboundLocalError. Initialize ok = False at the start of the method to ensure it always has a value.

sdcm/logcollector.py

soyacz

asssuming it works, LGTM
As I said, this is enabled with cloud-init anyway, so we could think of using sct-agent for that in future (this would work for all cloud backends then).

fruch · 2025-10-29T09:17:58Z

asssuming it works, LGTM As I said, this is enabled with cloud-init anyway, so we could think of using sct-agent for that in future (this would work for all cloud backends then).

I guess for all the backend we'll support our own agent, we'll use that first, and it won't be a fallback from ssh.

fruch requested review from a team and Copilot October 23, 2025 10:28

github-actions bot assigned fruch Oct 23, 2025

Copilot AI reviewed Oct 23, 2025

View reviewed changes

fruch added backport/perf-v17 backport/2025.4 labels Oct 23, 2025

fruch added the test-provision-aws Run provision test on AWS label Oct 23, 2025

fruch force-pushed the aws_logs_fallback_to_ssm branch 3 times, most recently from 3f1265f to 391f845 Compare October 26, 2025 20:57

fruch marked this pull request as ready for review October 27, 2025 06:08

fruch force-pushed the aws_logs_fallback_to_ssm branch from 391f845 to 2f34375 Compare October 27, 2025 21:38

fruch requested a review from Copilot October 28, 2025 20:48

Copilot AI reviewed Oct 28, 2025

View reviewed changes

soyacz reviewed Oct 29, 2025

View reviewed changes

soyacz approved these changes Oct 29, 2025

View reviewed changes

fruch merged commit 5aad56a into scylladb:master Oct 30, 2025
15 checks passed

scylladbbot added the promoted-to-master label Oct 30, 2025

This was referenced Oct 30, 2025

[Backport perf-v17] feature(aws): log fallback to ssm based access #12398

Merged

[Backport 2025.4] feature(aws): log fallback to ssm based access #12399

Merged

scylladbbot added backport/2025.4-done backport/perf-v17-done and removed backport/2025.4 backport/perf-v17 labels Oct 30, 2025

feature(aws): log fallback to ssm based access #12298

feature(aws): log fallback to ssm based access #12298

Uh oh!

Conversation

fruch commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODO

Testing

PR pre-checks (self review)

Reminders

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fruch commented Oct 23, 2025

Uh oh!

soyacz commented Oct 23, 2025

Uh oh!

fruch commented Oct 23, 2025

Uh oh!

fruch commented Oct 28, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

soyacz left a comment

Choose a reason for hiding this comment

Uh oh!

fruch commented Oct 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fruch commented Oct 23, 2025 •

edited

Loading