Skip to content

Conversation

@dimakr
Copy link
Contributor

@dimakr dimakr commented Oct 20, 2025

Integrate sct-agent MVP into SCT, as an alternative to using SSH for command execution on DB nodes.

Refs: https://github.com/scylladb/qa-tasks/issues/1858

Testing

  • tested locally with a new test-cases/agent-test-aws.yaml test configuration, that was added to serve as a sanity test for sct-agent - it is essentially the pr-provision-test-like configuration, but with sct-agent enabled, hence the commands on DB nodes are executed via the agent, not SSH.

The status of the running agent on a DB node (the agent is registered as systemd service):

scyllaadm@agent-test-dmitriy-db-node-cfa6fbca-2:~$ sudo systemctl status sct-agent
● sct-agent.service - SCT Agent - Command execution agent for Scylla Cluster Tests
     Loaded: loaded (/etc/systemd/system/sct-agent.service; enabled; preset: enabled)
     Active: active (running) since Mon 2025-10-20 21:46:39 UTC; 16min ago
   Main PID: 4314 (sct-agent)
      Tasks: 10 (limit: 18771)
     Memory: 238.7M (peak: 506.2M)
        CPU: 12.609s
     CGroup: /system.slice/sct-agent.service
             ├─4314 /usr/local/bin/sct-agent --config /etc/sct-agent/config.yaml
             ├─6360 dirmngr --homedir /tmp --daemon
             └─6363 gpg-agent --homedir /tmp --use-standard-socket --daemon

Oct 20 21:58:31 agent-test-dmitriy-db-node-cfa6fbca-2 sudo[7197]:     root : PWD=/ ; USER=root ; COMMAND=/usr/bin/grep ^SCYLLA_ARGS= /etc/default/scylla-server
Oct 20 21:58:31 agent-test-dmitriy-db-node-cfa6fbca-2 sudo[7197]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=0)
Oct 20 21:58:31 agent-test-dmitriy-db-node-cfa6fbca-2 sudo[7197]: pam_unix(sudo:session): session closed for user root
Oct 20 21:58:31 agent-test-dmitriy-db-node-cfa6fbca-2 sct-agent[4314]: 2025/10/20 21:58:31 sct-agent[6ba45b84]: Command completed successfully: exit_code=0 duration=8ms cmd=/bin/bash -c sudo grep "^SCYLLA_ARGS=" /etc/default/scylla-server
Oct 20 21:58:31 agent-test-dmitriy-db-node-cfa6fbca-2 sct-agent[4314]: [2025-10-20 21:58:31] GET /api/v1/commands/6ba45b84-7c56-4f02-a30b-54a83322693f 200 41.095µs 46.243.158.145
Oct 20 21:58:31 agent-test-dmitriy-db-node-cfa6fbca-2 sct-agent[4314]: [2025-10-20 21:58:31] POST /api/v1/commands 200 585.193µs 46.243.158.145
Oct 20 21:58:31 agent-test-dmitriy-db-node-cfa6fbca-2 sct-agent[4314]: 2025/10/20 21:58:31 sct-agent[1d412446]: PWD=/ ; COMMAND=/bin/bash -c cat /etc/scylla.d/cpuset.conf
Oct 20 21:58:31 agent-test-dmitriy-db-node-cfa6fbca-2 sct-agent[4314]: 2025/10/20 21:58:31 sct-agent[1d412446]: Command completed successfully: exit_code=0 duration=2ms cmd=/bin/bash -c cat /etc/scylla.d/cpuset.conf
Oct 20 21:58:31 agent-test-dmitriy-db-node-cfa6fbca-2 sct-agent[4314]: [2025-10-20 21:58:31] GET /api/v1/commands/1d412446-cb56-4d01-90f1-b69e7d9034dc 200 35.189µs 46.243.158.145
Oct 20 21:58:32 agent-test-dmitriy-db-node-cfa6fbca-2 sct-agent[4314]: [2025-10-20 21:58:32] GET /health 200 147.001µs 46.243.158.145

Some logs of the agent service:

scyllaadm@agent-test-dmitriy-db-node-cfa6fbca-2:~$ journalctl -u sct-agent | grep -i 'command completed' -A5 -B5
...
Oct 20 21:52:23 agent-test-dmitriy-db-node-cfa6fbca-2 sct-agent[4314]: 2025/10/20 21:52:23 sct-agent[ac5e6753]: PWD=/ ; COMMAND=/bin/bash -c sudo systemctl stop scylla-server.service
Oct 20 21:52:23 agent-test-dmitriy-db-node-cfa6fbca-2 sudo[6333]:     root : PWD=/ ; USER=root ; COMMAND=/usr/bin/systemctl stop scylla-server.service
Oct 20 21:52:23 agent-test-dmitriy-db-node-cfa6fbca-2 sudo[6333]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=0)
Oct 20 21:52:23 agent-test-dmitriy-db-node-cfa6fbca-2 sudo[6333]: pam_unix(sudo:session): session closed for user root
Oct 20 21:52:23 agent-test-dmitriy-db-node-cfa6fbca-2 sct-agent[4314]: 2025/10/20 21:52:23 sct-agent[ac5e6753]: Command completed successfully: exit_code=0 duration=16ms cmd=/bin/bash -c sudo systemctl stop scylla-server.service
Oct 20 21:52:23 agent-test-dmitriy-db-node-cfa6fbca-2 sct-agent[4314]: [2025-10-20 21:52:23] GET /api/v1/commands/ac5e6753-1430-4890-90fb-f2a87404cb90 200 34.252µs 46.243.158.145
Oct 20 21:52:23 agent-test-dmitriy-db-node-cfa6fbca-2 sct-agent[4314]: [2025-10-20 21:52:23] POST /api/v1/commands 200 557.768µs 46.243.158.145
Oct 20 21:52:23 agent-test-dmitriy-db-node-cfa6fbca-2 sct-agent[4314]: 2025/10/20 21:52:23 sct-agent[56f7425a]: PWD=/ ; COMMAND=/bin/bash -c sudo rm -rf /var/lib/scylla/data/*
Oct 20 21:52:23 agent-test-dmitriy-db-node-cfa6fbca-2 sudo[6335]:     root : PWD=/ ; USER=root ; COMMAND=/usr/bin/rm -rf /var/lib/scylla/data/*
Oct 20 21:52:23 agent-test-dmitriy-db-node-cfa6fbca-2 sudo[6335]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=0)
Oct 20 21:52:23 agent-test-dmitriy-db-node-cfa6fbca-2 sudo[6335]: pam_unix(sudo:session): session closed for user root
Oct 20 21:52:23 agent-test-dmitriy-db-node-cfa6fbca-2 sct-agent[4314]: 2025/10/20 21:52:23 sct-agent[56f7425a]: Command completed successfully: exit_code=0 duration=9ms cmd=/bin/bash -c sudo rm -rf /var/lib/scylla/data/*
Oct 20 21:52:23 agent-test-dmitriy-db-node-cfa6fbca-2 sct-agent[4314]: [2025-10-20 21:52:23] GET /api/v1/commands/56f7425a-6169-47c0-83e4-3c19b6ddeb5f 200 33.039µs 46.243.158.145
Oct 20 21:52:23 agent-test-dmitriy-db-node-cfa6fbca-2 sct-agent[4314]: [2025-10-20 21:52:23] POST /api/v1/commands 200 547.074µs 46.243.158.145
Oct 20 21:52:23 agent-test-dmitriy-db-node-cfa6fbca-2 sct-agent[4314]: 2025/10/20 21:52:23 sct-agent[a20e695c]: PWD=/ ; COMMAND=/bin/bash -c sudo find /var/lib/scylla/commitlog -type f -delete
Oct 20 21:52:23 agent-test-dmitriy-db-node-cfa6fbca-2 sudo[6337]:     root : PWD=/ ; USER=root ; COMMAND=/usr/bin/find /var/lib/scylla/commitlog -type f -delete
Oct 20 21:52:23 agent-test-dmitriy-db-node-cfa6fbca-2 sudo[6337]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=0)
Oct 20 21:52:23 agent-test-dmitriy-db-node-cfa6fbca-2 sudo[6337]: pam_unix(sudo:session): session closed for user root
Oct 20 21:52:23 agent-test-dmitriy-db-node-cfa6fbca-2 sct-agent[4314]: 2025/10/20 21:52:23 sct-agent[a20e695c]: Command completed successfully: exit_code=0 duration=10ms cmd=/bin/bash -c sudo find /var/lib/scylla/commitlog -type f -delete

PR pre-checks (self review)

  • I added the relevant backport labels
  • I didn't leave commented-out/debugging code

Reminders

  • Add New configuration option and document them (in sdcm/sct_config.py)
  • Add unit tests to cover my changes (under unit-test/ folder)
  • Update the Readme/doc folder relevant to this change (if needed)

Integrate sct-agent MVP into SCT, as an alternative to using SSH for command
execution on DB nodes.

Refs: scylladb/qa-tasks#1858
@dimakr dimakr self-assigned this Oct 20, 2025
@dimakr dimakr added the backport/none Backport is not required label Oct 20, 2025
@dimakr dimakr requested review from Copilot, fruch and soyacz October 20, 2025 22:23
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR integrates the sct-agent MVP as an alternative to SSH for command execution on DB nodes. The agent is a lightweight REST API service that runs on database nodes and handles command execution requests, providing better reliability and performance compared to SSH-based execution.

Key changes:

  • Added agent client and command runner implementation for REST API-based command execution
  • Implemented agent installation via cloud-init user data scripts with systemd service configuration
  • Added configuration options and security group rules to enable agent communication
  • Created test configuration demonstrating agent usage in AWS environment

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
test-cases/agent-test-aws.yaml New test configuration enabling sct-agent for sanity testing
sdcm/utils/sct_agent_installer.py Installer script generation utilities for deploying agent binary and configuration
sdcm/utils/aws_region.py Security group rule allowing agent REST API port (15000)
sdcm/utils/agent_client.py HTTP client for agent REST API with job execution and polling
sdcm/sct_provision/user_data_objects/sct_agent.py User data object for agent installation during instance provisioning
sdcm/sct_provision/region_definition_builder.py Registration of agent user data object in provisioning pipeline
sdcm/sct_provision/aws/user_data.py Added install_agent flag to AWS user data builder
sdcm/sct_config.py Configuration schema for agent settings
sdcm/remote/agent_cmd_runner.py CommandRunner implementation using agent API instead of SSH
sdcm/provision/common/configuration_script.py Integration of agent installation in post-boot configuration
sdcm/cluster.py Node initialization logic to use agent when enabled for DB nodes
docs/configuration_options.md Documentation of agent configuration options
defaults/test_default.yaml Default agent configuration values

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

DEFAULT_AGENT_SERVICE_PATH = "/etc/systemd/system/sct-agent.service"


def get_agent_config_yaml(api_keys: list[str], port: int = DEFAULT_AGENT_PORT, max_concurrent_jobs: int = 10) -> str:
Copy link

Copilot AI Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function docstring does not document the return format. Consider adding details about the YAML structure being returned, such as the top-level keys (server, security, executor) and their purpose.

Copilot uses AI. Check for mistakes.
Comment on lines +52 to +53
def get_agent_systemd_service(binary_path: str = DEFAULT_AGENT_BINARY_PATH,
config_path: str = DEFAULT_AGENT_CONFIG_PATH) -> str:
Copy link

Copilot AI Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function docstring does not document the return format. Consider adding details about the systemd unit file structure being returned.

Copilot uses AI. Check for mistakes.
config_yaml = get_agent_config_yaml(api_keys, port, max_concurrent_jobs).replace('$', '\\$')
service_content = get_agent_systemd_service(binary_path, config_path).replace('$', '\\$')

return f"""echo "Installing SCT Agent..."
Copy link

Copilot AI Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The multi-line f-string for the bash script (lines 106-145) lacks proper indentation consistency and could be error-prone. Consider using dedent() like the other functions in this file for better maintainability.

Copilot uses AI. Check for mistakes.
class AgentClient:
"""HTTP client for SCT Agent API"""

def __init__(self, hostname: str, port: int = 15000, api_key: str = "default-api-key", timeout: int = 30):
Copy link

Copilot AI Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default api_key value 'default-api-key' is insecure and should not be used in production. Consider removing the default value to force explicit key specification, or document that this is only for testing.

Suggested change
def __init__(self, hostname: str, port: int = 15000, api_key: str = "default-api-key", timeout: int = 30):
def __init__(self, hostname: str, port: int = 15000, api_key: str, timeout: int = 30):

Copilot uses AI. Check for mistakes.
except requests.exceptions.Timeout as exc:
raise AgentTimeoutError(f"{operation_name.capitalize()} timed out after {self.timeout}s") from exc
except requests.exceptions.HTTPError as exc:
error_data = exc.response.json() if exc.response.content else {}
Copy link

Copilot AI Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling .json() on a response with content may raise JSONDecodeError if the content is not valid JSON. Wrap this in a try-except block to handle malformed responses gracefully.

Suggested change
error_data = exc.response.json() if exc.response.content else {}
if exc.response.content:
try:
error_data = exc.response.json()
except ValueError:
error_data = {}
else:
error_data = {}

Copilot uses AI. Check for mistakes.
encoded = base64.b64encode(tar_buffer.getvalue()).decode('ascii')

self.run(f"{sudo_prefix}mkdir -p {dst}", verbose=verbose, ignore_status=False)
self.run(f"echo '{encoded}' | (cd {dst} && {sudo_prefix}base64 -d | {sudo_prefix}tar xzf -)",
Copy link

Copilot AI Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Command injection vulnerability: the dst parameter is directly interpolated into a shell command without sanitization. If dst contains shell metacharacters, it could lead to arbitrary command execution. Consider using proper escaping or validation.

Copilot uses AI. Check for mistakes.

dst_path = os.path.join(dst, os.path.basename(src)) if dst.endswith('/') else dst
self.run(f"{sudo_prefix}mkdir -p $(dirname {dst_path})", verbose=verbose, ignore_status=False)
self.run(f"echo '{encoded}' | {sudo_prefix}base64 -d > {dst_path}", verbose=verbose, ignore_status=False)
Copy link

Copilot AI Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Command injection vulnerability: the dst_path parameter is directly interpolated into a shell command without sanitization. Consider using proper escaping or validation to prevent potential command injection.

Copilot uses AI. Check for mistakes.
:return: indication if operation was successful
"""
sudo_prefix = "sudo " if sudo else ""
result = self.run(f"{sudo_prefix}cat {src}", timeout=timeout, ignore_status=False, verbose=False)
Copy link

Copilot AI Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Command injection vulnerability: the src parameter is directly interpolated into a shell command without sanitization. Consider using proper escaping or validation.

Copilot uses AI. Check for mistakes.
def _init_remoter(self, ssh_login_info):
self.remoter = RemoteCmdRunnerBase.create_remoter(**ssh_login_info)
agent_config = self.parent_cluster.params.get('agent')
if agent_config['enabled'] and 'db' in self.parent_cluster.node_type:
Copy link

Copilot AI Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential KeyError if agent_config is None or doesn't contain 'enabled' key. While params.get() returns a default value, accessing agent_config['enabled'] assumes the key exists. Consider using agent_config.get('enabled', False) for safer access.

Suggested change
if agent_config['enabled'] and 'db' in self.parent_cluster.node_type:
if agent_config and agent_config.get('enabled', False) and 'db' in self.parent_cluster.node_type:

Copilot uses AI. Check for mistakes.

if self.install_agent and self.test_config:
agent_config = self.test_config.tester_obj().params.get('agent')
if agent_config.get('enabled'):
Copy link

Copilot AI Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential AttributeError or KeyError: if agent_config is None, calling .get('enabled') will fail. The code should handle the case where agent_config might be None before accessing its attributes.

Suggested change
if agent_config.get('enabled'):
if agent_config and agent_config.get('enabled'):

Copilot uses AI. Check for mistakes.
@dimakr dimakr marked this pull request as ready for review October 21, 2025 06:32
@dimakr dimakr changed the title feature(sct-agent): sc-agent MVP integration feature(sct-agent): sct-agent MVP integration Oct 21, 2025
agent:
enabled: false
port: 15000
api_key: "test-agent-secure-key-12345"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could avoid specifying api_key: sct should generate it in runtime, store it in file on remote host (locally too for reuse cluster feature).
Agent should get param where to look for api key and load it during startup

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I think generating credentials per test is good direction
  • maybe we can put it where the log are going - so it clear how we can retrieve it for debugging ?
  • we should consider if we gonna recommend this for manual usage - like ssh replacement, replacing hydra ssh / hydra cp ?

test_config=self.test_config,
install_docker=self.node_type == 'loader')
install_docker=self.node_type == 'loader',
install_agent=agent_config.get('enabled') and 'db' in self.node_type)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for azure we use different UserDataBuilder - need to add another class. See example: sdcm.sct_provision.user_data_objects.vector_dev.VectorDevUserDataObject and how it is used sdcm.sct_provision.region_definition_builder.DefinitionBuilder._get_user_data_objects

self.hostname = hostname
self.port = port
self.api_key = api_key
self.base_url = f"http://{hostname}:{port}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

due security reasons we must prevent from using agent with public IP's - until https is supported

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let consider using ssh tunnel for development until we have https

StandardError=journal
SyslogIdentifier=sct-agent
LimitNOFILE=65536
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice if we could pin sct-agent to specific cpu - not reserved for scylla process (with CPUAffinity). This info should be available in some scylla config file.
This way running commands would not affect performance (improtant for perf tests)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a bit tricky, you can figure it out upfront very easily.

I would suggest noting that we'll want ability to monitor usage of resources of the agent, and it we found it problematic, we can think of such solutions

also keep in mind the other non pinned by scylla cpus, are for handling networking IRQ, if we do too much on them, it a gonna cause other problem as well.

it's better that out cpu utilization would be minimal

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we should keep cpu utilization minimal at first place.
Second place would be to pin it to specific cpu and make all commands executed also pinned to non-scylla cpu's.
This way impact on scylla would be minimal - dev's are trying hard to keep other services out of scylla's cpus as switching tasks degrades performance - that's why, with such agent I think we could try doing that.

@fruch
Copy link
Contributor

fruch commented Oct 21, 2025

lets give it a run on jenkins for couple of hours with nemesis, to flush out, that it's capable to handle wide set of the commands we are doing. (long running, retries, returning correctly)

@fruch
Copy link
Contributor

fruch commented Oct 21, 2025

@scylladbbot
Copy link

@dimakr new branch manager-3.7 was added, please add backport label if needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport/none Backport is not required

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants