-
Notifications
You must be signed in to change notification settings - Fork 110
feature(sct-agent): sct-agent MVP integration #12254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Integrate sct-agent MVP into SCT, as an alternative to using SSH for command execution on DB nodes. Refs: scylladb/qa-tasks#1858
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR integrates the sct-agent MVP as an alternative to SSH for command execution on DB nodes. The agent is a lightweight REST API service that runs on database nodes and handles command execution requests, providing better reliability and performance compared to SSH-based execution.
Key changes:
- Added agent client and command runner implementation for REST API-based command execution
- Implemented agent installation via cloud-init user data scripts with systemd service configuration
- Added configuration options and security group rules to enable agent communication
- Created test configuration demonstrating agent usage in AWS environment
Reviewed Changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| test-cases/agent-test-aws.yaml | New test configuration enabling sct-agent for sanity testing |
| sdcm/utils/sct_agent_installer.py | Installer script generation utilities for deploying agent binary and configuration |
| sdcm/utils/aws_region.py | Security group rule allowing agent REST API port (15000) |
| sdcm/utils/agent_client.py | HTTP client for agent REST API with job execution and polling |
| sdcm/sct_provision/user_data_objects/sct_agent.py | User data object for agent installation during instance provisioning |
| sdcm/sct_provision/region_definition_builder.py | Registration of agent user data object in provisioning pipeline |
| sdcm/sct_provision/aws/user_data.py | Added install_agent flag to AWS user data builder |
| sdcm/sct_config.py | Configuration schema for agent settings |
| sdcm/remote/agent_cmd_runner.py | CommandRunner implementation using agent API instead of SSH |
| sdcm/provision/common/configuration_script.py | Integration of agent installation in post-boot configuration |
| sdcm/cluster.py | Node initialization logic to use agent when enabled for DB nodes |
| docs/configuration_options.md | Documentation of agent configuration options |
| defaults/test_default.yaml | Default agent configuration values |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| DEFAULT_AGENT_SERVICE_PATH = "/etc/systemd/system/sct-agent.service" | ||
|
|
||
|
|
||
| def get_agent_config_yaml(api_keys: list[str], port: int = DEFAULT_AGENT_PORT, max_concurrent_jobs: int = 10) -> str: |
Copilot
AI
Oct 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function docstring does not document the return format. Consider adding details about the YAML structure being returned, such as the top-level keys (server, security, executor) and their purpose.
| def get_agent_systemd_service(binary_path: str = DEFAULT_AGENT_BINARY_PATH, | ||
| config_path: str = DEFAULT_AGENT_CONFIG_PATH) -> str: |
Copilot
AI
Oct 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function docstring does not document the return format. Consider adding details about the systemd unit file structure being returned.
| config_yaml = get_agent_config_yaml(api_keys, port, max_concurrent_jobs).replace('$', '\\$') | ||
| service_content = get_agent_systemd_service(binary_path, config_path).replace('$', '\\$') | ||
|
|
||
| return f"""echo "Installing SCT Agent..." |
Copilot
AI
Oct 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The multi-line f-string for the bash script (lines 106-145) lacks proper indentation consistency and could be error-prone. Consider using dedent() like the other functions in this file for better maintainability.
| class AgentClient: | ||
| """HTTP client for SCT Agent API""" | ||
|
|
||
| def __init__(self, hostname: str, port: int = 15000, api_key: str = "default-api-key", timeout: int = 30): |
Copilot
AI
Oct 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default api_key value 'default-api-key' is insecure and should not be used in production. Consider removing the default value to force explicit key specification, or document that this is only for testing.
| def __init__(self, hostname: str, port: int = 15000, api_key: str = "default-api-key", timeout: int = 30): | |
| def __init__(self, hostname: str, port: int = 15000, api_key: str, timeout: int = 30): |
| except requests.exceptions.Timeout as exc: | ||
| raise AgentTimeoutError(f"{operation_name.capitalize()} timed out after {self.timeout}s") from exc | ||
| except requests.exceptions.HTTPError as exc: | ||
| error_data = exc.response.json() if exc.response.content else {} |
Copilot
AI
Oct 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling .json() on a response with content may raise JSONDecodeError if the content is not valid JSON. Wrap this in a try-except block to handle malformed responses gracefully.
| error_data = exc.response.json() if exc.response.content else {} | |
| if exc.response.content: | |
| try: | |
| error_data = exc.response.json() | |
| except ValueError: | |
| error_data = {} | |
| else: | |
| error_data = {} |
| encoded = base64.b64encode(tar_buffer.getvalue()).decode('ascii') | ||
|
|
||
| self.run(f"{sudo_prefix}mkdir -p {dst}", verbose=verbose, ignore_status=False) | ||
| self.run(f"echo '{encoded}' | (cd {dst} && {sudo_prefix}base64 -d | {sudo_prefix}tar xzf -)", |
Copilot
AI
Oct 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Command injection vulnerability: the dst parameter is directly interpolated into a shell command without sanitization. If dst contains shell metacharacters, it could lead to arbitrary command execution. Consider using proper escaping or validation.
|
|
||
| dst_path = os.path.join(dst, os.path.basename(src)) if dst.endswith('/') else dst | ||
| self.run(f"{sudo_prefix}mkdir -p $(dirname {dst_path})", verbose=verbose, ignore_status=False) | ||
| self.run(f"echo '{encoded}' | {sudo_prefix}base64 -d > {dst_path}", verbose=verbose, ignore_status=False) |
Copilot
AI
Oct 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Command injection vulnerability: the dst_path parameter is directly interpolated into a shell command without sanitization. Consider using proper escaping or validation to prevent potential command injection.
| :return: indication if operation was successful | ||
| """ | ||
| sudo_prefix = "sudo " if sudo else "" | ||
| result = self.run(f"{sudo_prefix}cat {src}", timeout=timeout, ignore_status=False, verbose=False) |
Copilot
AI
Oct 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Command injection vulnerability: the src parameter is directly interpolated into a shell command without sanitization. Consider using proper escaping or validation.
| def _init_remoter(self, ssh_login_info): | ||
| self.remoter = RemoteCmdRunnerBase.create_remoter(**ssh_login_info) | ||
| agent_config = self.parent_cluster.params.get('agent') | ||
| if agent_config['enabled'] and 'db' in self.parent_cluster.node_type: |
Copilot
AI
Oct 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential KeyError if agent_config is None or doesn't contain 'enabled' key. While params.get() returns a default value, accessing agent_config['enabled'] assumes the key exists. Consider using agent_config.get('enabled', False) for safer access.
| if agent_config['enabled'] and 'db' in self.parent_cluster.node_type: | |
| if agent_config and agent_config.get('enabled', False) and 'db' in self.parent_cluster.node_type: |
|
|
||
| if self.install_agent and self.test_config: | ||
| agent_config = self.test_config.tester_obj().params.get('agent') | ||
| if agent_config.get('enabled'): |
Copilot
AI
Oct 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential AttributeError or KeyError: if agent_config is None, calling .get('enabled') will fail. The code should handle the case where agent_config might be None before accessing its attributes.
| if agent_config.get('enabled'): | |
| if agent_config and agent_config.get('enabled'): |
| agent: | ||
| enabled: false | ||
| port: 15000 | ||
| api_key: "test-agent-secure-key-12345" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could avoid specifying api_key: sct should generate it in runtime, store it in file on remote host (locally too for reuse cluster feature).
Agent should get param where to look for api key and load it during startup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I think generating credentials per test is good direction
- maybe we can put it where the log are going - so it clear how we can retrieve it for debugging ?
- we should consider if we gonna recommend this for manual usage - like ssh replacement, replacing hydra ssh / hydra cp ?
| test_config=self.test_config, | ||
| install_docker=self.node_type == 'loader') | ||
| install_docker=self.node_type == 'loader', | ||
| install_agent=agent_config.get('enabled') and 'db' in self.node_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for azure we use different UserDataBuilder - need to add another class. See example: sdcm.sct_provision.user_data_objects.vector_dev.VectorDevUserDataObject and how it is used sdcm.sct_provision.region_definition_builder.DefinitionBuilder._get_user_data_objects
| self.hostname = hostname | ||
| self.port = port | ||
| self.api_key = api_key | ||
| self.base_url = f"http://{hostname}:{port}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
due security reasons we must prevent from using agent with public IP's - until https is supported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let consider using ssh tunnel for development until we have https
| StandardError=journal | ||
| SyslogIdentifier=sct-agent | ||
| LimitNOFILE=65536 | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be nice if we could pin sct-agent to specific cpu - not reserved for scylla process (with CPUAffinity). This info should be available in some scylla config file.
This way running commands would not affect performance (improtant for perf tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a bit tricky, you can figure it out upfront very easily.
I would suggest noting that we'll want ability to monitor usage of resources of the agent, and it we found it problematic, we can think of such solutions
also keep in mind the other non pinned by scylla cpus, are for handling networking IRQ, if we do too much on them, it a gonna cause other problem as well.
it's better that out cpu utilization would be minimal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we should keep cpu utilization minimal at first place.
Second place would be to pin it to specific cpu and make all commands executed also pinned to non-scylla cpu's.
This way impact on scylla would be minimal - dev's are trying hard to keep other services out of scylla's cpus as switching tasks degrades performance - that's why, with such agent I think we could try doing that.
|
lets give it a run on jenkins for couple of hours with nemesis, to flush out, that it's capable to handle wide set of the commands we are doing. (long running, retries, returning correctly) |
|
isn't that default should come from the configuration file ? |
|
@dimakr new branch |
Integrate sct-agent MVP into SCT, as an alternative to using SSH for command execution on DB nodes.
Refs: https://github.com/scylladb/qa-tasks/issues/1858
Testing
test-cases/agent-test-aws.yamltest configuration, that was added to serve as a sanity test for sct-agent - it is essentially thepr-provision-test-like configuration, but with sct-agent enabled, hence the commands on DB nodes are executed via the agent, not SSH.The status of the running agent on a DB node (the agent is registered as systemd service):
Some logs of the agent service:
PR pre-checks (self review)
backportlabelsReminders
sdcm/sct_config.py)unit-test/folder)