Skip to content

Conversation

@fruch
Copy link
Contributor

@fruch fruch commented Oct 23, 2025

since we are running into some issue we are failing to have ssh access but zero logs cause of it (we have multiple time during the years, credentails is or other cloud-init/boot issues)

in this change we are gonna make sure ssm-agents are working on our instances, and fallback to log during log collection if we can't have ssh access

  • added it to the regio configuration to enable it
  • added it the top of the cloud-init to unmask the agent see: scylladb/scylla-machine-image@b8e494d
  • SSMCommandRunner which have run() api as with our ssh based remoters
  • CommandLog collection is falling back to use SSMCommandRunner

Ref: #11581

TODO

  • configure all regions

Testing

  • locally - tested SSM implementation via actual machines, and region configuration code
  • aws provision
  • locally hardcode the fallback - to validate it's working

PR pre-checks (self review)

  • I added the relevant backport labels
  • I didn't leave commented-out/debugging code

Reminders

  • Add New configuration option and document them (in sdcm/sct_config.py)
  • Add unit tests to cover my changes (under unit-test/ folder)
  • Update the Readme/doc folder relevant to this change (if needed)

@fruch fruch requested review from a team and Copilot October 23, 2025 10:28
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds SSM (AWS Systems Manager) fallback support for log collection when SSH access fails on AWS instances. The implementation enables SSM agent on instances during provisioning and provides a boto3-based SSM command runner that can be used as an alternative to SSH-based remote execution.

Key changes:

  • Added SSM agent enablement in cloud-init configuration scripts
  • Implemented SSMCommandRunner class with boto3 for executing commands via SSM
  • Enhanced log collection to automatically fall back to SSM when SSH fails on AWS instances

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
sdcm/utils/aws_ssm_runner.py New module implementing SSM command execution via boto3
unit_tests/test_aws_ssm_runner.py Comprehensive unit tests for the SSM runner
sdcm/utils/aws_region.py Added SSM configuration method to region setup
sdcm/provision/aws/utils.py Added script to enable SSM agent on instances
sdcm/provision/aws/configuration_script.py Integrated SSM agent enablement into cloud-init
sdcm/logcollector.py Enhanced log collection with SSM fallback for AWS nodes

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@fruch
Copy link
Contributor Author

fruch commented Oct 23, 2025

@scylladb/qa-maintainers please give a high level look,
I'm still thinking on how to well enough test the fallback logic

@fruch fruch added the test-provision-aws Run provision test on AWS label Oct 23, 2025
@soyacz
Copy link
Contributor

soyacz commented Oct 23, 2025

We could wait for sct-agent and install it as the first thing.
Genreally, ssh issuue is more AWS fault and machine is not responding at all - so we wont have logs anyway.

@fruch
Copy link
Contributor Author

fruch commented Oct 23, 2025

We could wait for sct-agent and install it as the first thing.
Genreally, ssh issuue is more AWS fault and machine is not responding at all - so we wont have logs anyway.

We don't know that, until we see logs.

One agent or other, this is an agent we have in place, and we need to figure out this issue, it popping up several times a week now

@fruch fruch force-pushed the aws_logs_fallback_to_ssm branch 3 times, most recently from 3f1265f to 391f845 Compare October 26, 2025 20:57
@fruch fruch marked this pull request as ready for review October 27, 2025 06:08
since we are running into some issue we are failing to have ssh
access but zero logs cause of it (we have multiple time during the years,
credentails is or other cloud-init/boot issues)

in this change we are gonna make sure ssm-agents are working on our instances,
and fallback to log during log collection if we can't have ssh access

* added it to the regio configuration to enable it
* added it the top of the cloud-init to unmask the agent
  see: scylladb/scylla-machine-image@b8e494d
* `SSMCommandRunner` which have `run()` api as with our ssh based remoters
* `CommandLog` collection is falling back to use `SSMCommandRunner`

Ref: scylladb#11581

Update sdcm/utils/aws_region.py

Co-authored-by: Copilot <[email protected]>

Update sdcm/provision/aws/utils.py

Co-authored-by: Copilot <[email protected]>

Update sdcm/utils/aws_ssm_runner.py

Co-authored-by: Copilot <[email protected]>
@fruch fruch force-pushed the aws_logs_fallback_to_ssm branch from 391f845 to 2f34375 Compare October 27, 2025 21:38
@fruch
Copy link
Contributor Author

fruch commented Oct 28, 2025

@scylladb/qa-maintainers it would be nice if we can get this review and push, it would help for investigation of ssh connectivity issue we are face across the board

@fruch fruch requested a review from Copilot October 28, 2025 20:48
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

except (ImportError, AttributeError, TypeError, ValueError, KeyError, IndexError) as e:
LOGGER.error("Failed to run SSM command: %s", e)
return None, is_file_remote
return log_filename if ok else None, is_file_remote
Copy link

Copilot AI Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable ok is only defined within the try-except blocks (lines 681 and 701), but it's referenced outside those blocks on line 709. If an exception occurs in line 682-683 before ok is assigned, or if neither SSH nor SSM paths execute, this will raise an UnboundLocalError. Initialize ok = False at the start of the method to ensure it always has a value.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@soyacz soyacz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asssuming it works, LGTM
As I said, this is enabled with cloud-init anyway, so we could think of using sct-agent for that in future (this would work for all cloud backends then).

@fruch
Copy link
Contributor Author

fruch commented Oct 29, 2025

asssuming it works, LGTM As I said, this is enabled with cloud-init anyway, so we could think of using sct-agent for that in future (this would work for all cloud backends then).

I guess for all the backend we'll support our own agent, we'll use that first, and it won't be a fallback from ssh.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants