feature(sct-agent): sct-agent MVP integration #12254

dimakr · 2025-10-20T22:22:44Z

Integrate sct-agent MVP into SCT, as an alternative to using SSH for command execution on DB nodes.

Refs: https://github.com/scylladb/qa-tasks/issues/1858

Testing

tested locally with a new test-cases/agent-test-aws.yaml test configuration, that was added to serve as a sanity test for sct-agent - it is essentially the pr-provision-test-like configuration, but with sct-agent enabled, hence the commands on DB nodes are executed via the agent, not SSH.

The status of the running agent on a DB node (the agent is registered as systemd service):

scyllaadm@agent-test-dmitriy-db-node-cfa6fbca-2:~$ sudo systemctl status sct-agent
● sct-agent.service - SCT Agent - Command execution agent for Scylla Cluster Tests
     Loaded: loaded (/etc/systemd/system/sct-agent.service; enabled; preset: enabled)
     Active: active (running) since Mon 2025-10-20 21:46:39 UTC; 16min ago
   Main PID: 4314 (sct-agent)
      Tasks: 10 (limit: 18771)
     Memory: 238.7M (peak: 506.2M)
        CPU: 12.609s
     CGroup: /system.slice/sct-agent.service
             ├─4314 /usr/local/bin/sct-agent --config /etc/sct-agent/config.yaml
             ├─6360 dirmngr --homedir /tmp --daemon
             └─6363 gpg-agent --homedir /tmp --use-standard-socket --daemon

Oct 20 21:58:31 agent-test-dmitriy-db-node-cfa6fbca-2 sudo[7197]:     root : PWD=/ ; USER=root ; COMMAND=/usr/bin/grep ^SCYLLA_ARGS= /etc/default/scylla-server
Oct 20 21:58:31 agent-test-dmitriy-db-node-cfa6fbca-2 sudo[7197]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=0)
Oct 20 21:58:31 agent-test-dmitriy-db-node-cfa6fbca-2 sudo[7197]: pam_unix(sudo:session): session closed for user root
Oct 20 21:58:31 agent-test-dmitriy-db-node-cfa6fbca-2 sct-agent[4314]: 2025/10/20 21:58:31 sct-agent[6ba45b84]: Command completed successfully: exit_code=0 duration=8ms cmd=/bin/bash -c sudo grep "^SCYLLA_ARGS=" /etc/default/scylla-server
Oct 20 21:58:31 agent-test-dmitriy-db-node-cfa6fbca-2 sct-agent[4314]: [2025-10-20 21:58:31] GET /api/v1/commands/6ba45b84-7c56-4f02-a30b-54a83322693f 200 41.095µs 46.243.158.145
Oct 20 21:58:31 agent-test-dmitriy-db-node-cfa6fbca-2 sct-agent[4314]: [2025-10-20 21:58:31] POST /api/v1/commands 200 585.193µs 46.243.158.145
Oct 20 21:58:31 agent-test-dmitriy-db-node-cfa6fbca-2 sct-agent[4314]: 2025/10/20 21:58:31 sct-agent[1d412446]: PWD=/ ; COMMAND=/bin/bash -c cat /etc/scylla.d/cpuset.conf
Oct 20 21:58:31 agent-test-dmitriy-db-node-cfa6fbca-2 sct-agent[4314]: 2025/10/20 21:58:31 sct-agent[1d412446]: Command completed successfully: exit_code=0 duration=2ms cmd=/bin/bash -c cat /etc/scylla.d/cpuset.conf
Oct 20 21:58:31 agent-test-dmitriy-db-node-cfa6fbca-2 sct-agent[4314]: [2025-10-20 21:58:31] GET /api/v1/commands/1d412446-cb56-4d01-90f1-b69e7d9034dc 200 35.189µs 46.243.158.145
Oct 20 21:58:32 agent-test-dmitriy-db-node-cfa6fbca-2 sct-agent[4314]: [2025-10-20 21:58:32] GET /health 200 147.001µs 46.243.158.145

Some logs of the agent service:

scyllaadm@agent-test-dmitriy-db-node-cfa6fbca-2:~$ journalctl -u sct-agent | grep -i 'command completed' -A5 -B5
...
Oct 20 21:52:23 agent-test-dmitriy-db-node-cfa6fbca-2 sct-agent[4314]: 2025/10/20 21:52:23 sct-agent[ac5e6753]: PWD=/ ; COMMAND=/bin/bash -c sudo systemctl stop scylla-server.service
Oct 20 21:52:23 agent-test-dmitriy-db-node-cfa6fbca-2 sudo[6333]:     root : PWD=/ ; USER=root ; COMMAND=/usr/bin/systemctl stop scylla-server.service
Oct 20 21:52:23 agent-test-dmitriy-db-node-cfa6fbca-2 sudo[6333]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=0)
Oct 20 21:52:23 agent-test-dmitriy-db-node-cfa6fbca-2 sudo[6333]: pam_unix(sudo:session): session closed for user root
Oct 20 21:52:23 agent-test-dmitriy-db-node-cfa6fbca-2 sct-agent[4314]: 2025/10/20 21:52:23 sct-agent[ac5e6753]: Command completed successfully: exit_code=0 duration=16ms cmd=/bin/bash -c sudo systemctl stop scylla-server.service
Oct 20 21:52:23 agent-test-dmitriy-db-node-cfa6fbca-2 sct-agent[4314]: [2025-10-20 21:52:23] GET /api/v1/commands/ac5e6753-1430-4890-90fb-f2a87404cb90 200 34.252µs 46.243.158.145
Oct 20 21:52:23 agent-test-dmitriy-db-node-cfa6fbca-2 sct-agent[4314]: [2025-10-20 21:52:23] POST /api/v1/commands 200 557.768µs 46.243.158.145
Oct 20 21:52:23 agent-test-dmitriy-db-node-cfa6fbca-2 sct-agent[4314]: 2025/10/20 21:52:23 sct-agent[56f7425a]: PWD=/ ; COMMAND=/bin/bash -c sudo rm -rf /var/lib/scylla/data/*
Oct 20 21:52:23 agent-test-dmitriy-db-node-cfa6fbca-2 sudo[6335]:     root : PWD=/ ; USER=root ; COMMAND=/usr/bin/rm -rf /var/lib/scylla/data/*
Oct 20 21:52:23 agent-test-dmitriy-db-node-cfa6fbca-2 sudo[6335]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=0)
Oct 20 21:52:23 agent-test-dmitriy-db-node-cfa6fbca-2 sudo[6335]: pam_unix(sudo:session): session closed for user root
Oct 20 21:52:23 agent-test-dmitriy-db-node-cfa6fbca-2 sct-agent[4314]: 2025/10/20 21:52:23 sct-agent[56f7425a]: Command completed successfully: exit_code=0 duration=9ms cmd=/bin/bash -c sudo rm -rf /var/lib/scylla/data/*
Oct 20 21:52:23 agent-test-dmitriy-db-node-cfa6fbca-2 sct-agent[4314]: [2025-10-20 21:52:23] GET /api/v1/commands/56f7425a-6169-47c0-83e4-3c19b6ddeb5f 200 33.039µs 46.243.158.145
Oct 20 21:52:23 agent-test-dmitriy-db-node-cfa6fbca-2 sct-agent[4314]: [2025-10-20 21:52:23] POST /api/v1/commands 200 547.074µs 46.243.158.145
Oct 20 21:52:23 agent-test-dmitriy-db-node-cfa6fbca-2 sct-agent[4314]: 2025/10/20 21:52:23 sct-agent[a20e695c]: PWD=/ ; COMMAND=/bin/bash -c sudo find /var/lib/scylla/commitlog -type f -delete
Oct 20 21:52:23 agent-test-dmitriy-db-node-cfa6fbca-2 sudo[6337]:     root : PWD=/ ; USER=root ; COMMAND=/usr/bin/find /var/lib/scylla/commitlog -type f -delete
Oct 20 21:52:23 agent-test-dmitriy-db-node-cfa6fbca-2 sudo[6337]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=0)
Oct 20 21:52:23 agent-test-dmitriy-db-node-cfa6fbca-2 sudo[6337]: pam_unix(sudo:session): session closed for user root
Oct 20 21:52:23 agent-test-dmitriy-db-node-cfa6fbca-2 sct-agent[4314]: 2025/10/20 21:52:23 sct-agent[a20e695c]: Command completed successfully: exit_code=0 duration=10ms cmd=/bin/bash -c sudo find /var/lib/scylla/commitlog -type f -delete

PR pre-checks (self review)

I added the relevant backport labels
I didn't leave commented-out/debugging code

Reminders

Add New configuration option and document them (in sdcm/sct_config.py)
Add unit tests to cover my changes (under unit-test/ folder)
Update the Readme/doc folder relevant to this change (if needed)

Integrate sct-agent MVP into SCT, as an alternative to using SSH for command execution on DB nodes. Refs: scylladb/qa-tasks#1858

Copilot

Pull Request Overview

This PR integrates the sct-agent MVP as an alternative to SSH for command execution on DB nodes. The agent is a lightweight REST API service that runs on database nodes and handles command execution requests, providing better reliability and performance compared to SSH-based execution.

Key changes:

Added agent client and command runner implementation for REST API-based command execution
Implemented agent installation via cloud-init user data scripts with systemd service configuration
Added configuration options and security group rules to enable agent communication
Created test configuration demonstrating agent usage in AWS environment

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
test-cases/agent-test-aws.yaml	New test configuration enabling sct-agent for sanity testing
sdcm/utils/sct_agent_installer.py	Installer script generation utilities for deploying agent binary and configuration
sdcm/utils/aws_region.py	Security group rule allowing agent REST API port (15000)
sdcm/utils/agent_client.py	HTTP client for agent REST API with job execution and polling
sdcm/sct_provision/user_data_objects/sct_agent.py	User data object for agent installation during instance provisioning
sdcm/sct_provision/region_definition_builder.py	Registration of agent user data object in provisioning pipeline
sdcm/sct_provision/aws/user_data.py	Added install_agent flag to AWS user data builder
sdcm/sct_config.py	Configuration schema for agent settings
sdcm/remote/agent_cmd_runner.py	CommandRunner implementation using agent API instead of SSH
sdcm/provision/common/configuration_script.py	Integration of agent installation in post-boot configuration
sdcm/cluster.py	Node initialization logic to use agent when enabled for DB nodes
docs/configuration_options.md	Documentation of agent configuration options
defaults/test_default.yaml	Default agent configuration values

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-20T22:25:25Z

sdcm/utils/sct_agent_installer.py

+DEFAULT_AGENT_SERVICE_PATH = "/etc/systemd/system/sct-agent.service"
+
+
+def get_agent_config_yaml(api_keys: list[str], port: int = DEFAULT_AGENT_PORT, max_concurrent_jobs: int = 10) -> str:


The function docstring does not document the return format. Consider adding details about the YAML structure being returned, such as the top-level keys (server, security, executor) and their purpose.

Copilot · 2025-10-20T22:25:25Z

sdcm/utils/sct_agent_installer.py

+def get_agent_systemd_service(binary_path: str = DEFAULT_AGENT_BINARY_PATH,
+                              config_path: str = DEFAULT_AGENT_CONFIG_PATH) -> str:


The function docstring does not document the return format. Consider adding details about the systemd unit file structure being returned.

Copilot · 2025-10-20T22:25:25Z

sdcm/utils/sct_agent_installer.py

+    config_yaml = get_agent_config_yaml(api_keys, port, max_concurrent_jobs).replace('$', '\\$')
+    service_content = get_agent_systemd_service(binary_path, config_path).replace('$', '\\$')
+
+    return f"""echo "Installing SCT Agent..."


The multi-line f-string for the bash script (lines 106-145) lacks proper indentation consistency and could be error-prone. Consider using dedent() like the other functions in this file for better maintainability.

Copilot · 2025-10-20T22:25:26Z

sdcm/utils/agent_client.py

+class AgentClient:
+    """HTTP client for SCT Agent API"""
+
+    def __init__(self, hostname: str, port: int = 15000, api_key: str = "default-api-key", timeout: int = 30):


The default api_key value 'default-api-key' is insecure and should not be used in production. Consider removing the default value to force explicit key specification, or document that this is only for testing.

Suggested change

def __init__(self, hostname: str, port: int = 15000, api_key: str = "default-api-key", timeout: int = 30):

def __init__(self, hostname: str, port: int = 15000, api_key: str, timeout: int = 30):

Copilot · 2025-10-20T22:25:26Z

sdcm/utils/agent_client.py

+        except requests.exceptions.Timeout as exc:
+            raise AgentTimeoutError(f"{operation_name.capitalize()} timed out after {self.timeout}s") from exc
+        except requests.exceptions.HTTPError as exc:
+            error_data = exc.response.json() if exc.response.content else {}


Calling .json() on a response with content may raise JSONDecodeError if the content is not valid JSON. Wrap this in a try-except block to handle malformed responses gracefully.

Suggested change

error_data = exc.response.json() if exc.response.content else {}

if exc.response.content:

try:

error_data = exc.response.json()

except ValueError:

error_data = {}

else:

error_data = {}

Copilot · 2025-10-20T22:25:27Z

sdcm/remote/agent_cmd_runner.py

+            encoded = base64.b64encode(tar_buffer.getvalue()).decode('ascii')
+
+            self.run(f"{sudo_prefix}mkdir -p {dst}", verbose=verbose, ignore_status=False)
+            self.run(f"echo '{encoded}' | (cd {dst} && {sudo_prefix}base64 -d | {sudo_prefix}tar xzf -)",


Command injection vulnerability: the dst parameter is directly interpolated into a shell command without sanitization. If dst contains shell metacharacters, it could lead to arbitrary command execution. Consider using proper escaping or validation.

Copilot · 2025-10-20T22:25:27Z

sdcm/remote/agent_cmd_runner.py

+
+            dst_path = os.path.join(dst, os.path.basename(src)) if dst.endswith('/') else dst
+            self.run(f"{sudo_prefix}mkdir -p $(dirname {dst_path})", verbose=verbose, ignore_status=False)
+            self.run(f"echo '{encoded}' | {sudo_prefix}base64 -d > {dst_path}", verbose=verbose, ignore_status=False)


Command injection vulnerability: the dst_path parameter is directly interpolated into a shell command without sanitization. Consider using proper escaping or validation to prevent potential command injection.

Copilot · 2025-10-20T22:25:27Z

sdcm/remote/agent_cmd_runner.py

+        :return: indication if operation was successful
+        """
+        sudo_prefix = "sudo " if sudo else ""
+        result = self.run(f"{sudo_prefix}cat {src}", timeout=timeout, ignore_status=False, verbose=False)


Command injection vulnerability: the src parameter is directly interpolated into a shell command without sanitization. Consider using proper escaping or validation.

Copilot · 2025-10-20T22:25:28Z

sdcm/cluster.py

    def _init_remoter(self, ssh_login_info):
-        self.remoter = RemoteCmdRunnerBase.create_remoter(**ssh_login_info)
+        agent_config = self.parent_cluster.params.get('agent')
+        if agent_config['enabled'] and 'db' in self.parent_cluster.node_type:


Potential KeyError if agent_config is None or doesn't contain 'enabled' key. While params.get() returns a default value, accessing agent_config['enabled'] assumes the key exists. Consider using agent_config.get('enabled', False) for safer access.

Suggested change

if agent_config['enabled'] and 'db' in self.parent_cluster.node_type:

if agent_config and agent_config.get('enabled', False) and 'db' in self.parent_cluster.node_type:

Copilot · 2025-10-20T22:25:28Z

sdcm/provision/common/configuration_script.py


+        if self.install_agent and self.test_config:
+            agent_config = self.test_config.tester_obj().params.get('agent')
+            if agent_config.get('enabled'):


Potential AttributeError or KeyError: if agent_config is None, calling .get('enabled') will fail. The code should handle the case where agent_config might be None before accessing its attributes.

Suggested change

if agent_config.get('enabled'):

if agent_config and agent_config.get('enabled'):

soyacz · 2025-10-21T07:34:13Z

defaults/test_default.yaml

+agent:
+  enabled: false
+  port: 15000
+  api_key: "test-agent-secure-key-12345"


I think we could avoid specifying api_key: sct should generate it in runtime, store it in file on remote host (locally too for reuse cluster feature).
Agent should get param where to look for api key and load it during startup

I think generating credentials per test is good direction

maybe we can put it where the log are going - so it clear how we can retrieve it for debugging ?

we should consider if we gonna recommend this for manual usage - like ssh replacement, replacing hydra ssh / hydra cp ?

soyacz · 2025-10-21T07:37:59Z

sdcm/cluster.py

                                                  test_config=self.test_config,
-                                                  install_docker=self.node_type == 'loader')
+                                                  install_docker=self.node_type == 'loader',
+                                                  install_agent=agent_config.get('enabled') and 'db' in self.node_type)


for azure we use different UserDataBuilder - need to add another class. See example: sdcm.sct_provision.user_data_objects.vector_dev.VectorDevUserDataObject and how it is used sdcm.sct_provision.region_definition_builder.DefinitionBuilder._get_user_data_objects

soyacz · 2025-10-21T07:42:08Z

sdcm/utils/agent_client.py

+        self.hostname = hostname
+        self.port = port
+        self.api_key = api_key
+        self.base_url = f"http://{hostname}:{port}"


due security reasons we must prevent from using agent with public IP's - until https is supported

let consider using ssh tunnel for development until we have https

soyacz · 2025-10-21T07:52:36Z

sdcm/utils/sct_agent_installer.py

+        StandardError=journal
+        SyslogIdentifier=sct-agent
+        LimitNOFILE=65536
+


would be nice if we could pin sct-agent to specific cpu - not reserved for scylla process (with CPUAffinity). This info should be available in some scylla config file.
This way running commands would not affect performance (improtant for perf tests)

that's a bit tricky, you can figure it out upfront very easily.

I would suggest noting that we'll want ability to monitor usage of resources of the agent, and it we found it problematic, we can think of such solutions

also keep in mind the other non pinned by scylla cpus, are for handling networking IRQ, if we do too much on them, it a gonna cause other problem as well.

it's better that out cpu utilization would be minimal

yes, we should keep cpu utilization minimal at first place.
Second place would be to pin it to specific cpu and make all commands executed also pinned to non-scylla cpu's.
This way impact on scylla would be minimal - dev's are trying hard to keep other services out of scylla's cpus as switching tasks degrades performance - that's why, with such agent I think we could try doing that.

fruch · 2025-10-21T08:22:38Z

lets give it a run on jenkins for couple of hours with nemesis, to flush out, that it's capable to handle wide set of the commands we are doing. (long running, retries, returning correctly)

fruch · 2025-10-21T14:47:36Z

@dimakr

isn't that default should come from the configuration file ?
https://github.com/dimakr/sct-agent/blob/7096ef7827ae069863b133c93a515cb4b95260e5/internal/executor/executor.go#L38

scylladbbot · 2025-10-23T13:35:01Z

@dimakr new branch manager-3.7 was added, please add backport label if needed

feature(sct-agent): sc-agent MVP integration

c5e4616

Integrate sct-agent MVP into SCT, as an alternative to using SSH for command execution on DB nodes. Refs: scylladb/qa-tasks#1858

dimakr self-assigned this Oct 20, 2025

dimakr added the backport/none Backport is not required label Oct 20, 2025

dimakr requested review from Copilot, fruch and soyacz October 20, 2025 22:23

Copilot AI reviewed Oct 20, 2025

View reviewed changes

dimakr marked this pull request as ready for review October 21, 2025 06:32

dimakr changed the title ~~feature(sct-agent): sc-agent MVP integration~~ feature(sct-agent): sct-agent MVP integration Oct 21, 2025

soyacz reviewed Oct 21, 2025

View reviewed changes

		DEFAULT_AGENT_SERVICE_PATH = "/etc/systemd/system/sct-agent.service"


		def get_agent_config_yaml(api_keys: list[str], port: int = DEFAULT_AGENT_PORT, max_concurrent_jobs: int = 10) -> str:

		def get_agent_systemd_service(binary_path: str = DEFAULT_AGENT_BINARY_PATH,
		config_path: str = DEFAULT_AGENT_CONFIG_PATH) -> str:

	def __init__(self, hostname: str, port: int = 15000, api_key: str = "default-api-key", timeout: int = 30):
	def __init__(self, hostname: str, port: int = 15000, api_key: str, timeout: int = 30):

-            error_data = exc.response.json() if exc.response.content else {}
+            if exc.response.content:
+                try:
+                    error_data = exc.response.json()
+                except ValueError:
+                    error_data = {}
+            else:
+                error_data = {}

	if agent_config['enabled'] and 'db' in self.parent_cluster.node_type:
	if agent_config and agent_config.get('enabled', False) and 'db' in self.parent_cluster.node_type:

	if agent_config.get('enabled'):
	if agent_config and agent_config.get('enabled'):

feature(sct-agent): sct-agent MVP integration #12254

Are you sure you want to change the base?

feature(sct-agent): sct-agent MVP integration #12254

Conversation

dimakr commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

PR pre-checks (self review)

Reminders

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fruch commented Oct 21, 2025

Uh oh!

fruch commented Oct 21, 2025

Uh oh!

scylladbbot commented Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dimakr commented Oct 20, 2025 •

edited

Loading