Skip to content

Conversation

@hhoikoo
Copy link
Member

@hhoikoo hhoikoo commented Oct 22, 2025

resolves #6314 (BA-2753)

This PR adds support for actually spawning multiple agents within the same agent server and adding agent_id field for all appropriate RPC calls in the agent server, then ensuring that the manager sends that info such that the agent server can correctly route the RPC calls to the correct agent.

Checklist: (if applicable)

  • Milestone metadata specifying the target backport version
  • Mention to the original issue
  • Installer updates including:
    • Fixtures for db schema changes
    • New mandatory config options
  • Update of end-to-end CLI integration tests in ai.backend.test
  • API server-client counterparts (e.g., manager API -> client SDK)
  • Test case(s) to:
    • Demonstrate the difference of before/after
    • Demonstrate the flow of abstract/conceptual models with a concrete implementation
  • Documentation
    • Contents in the docs directory
    • docstrings in public interfaces and type annotations

@hhoikoo hhoikoo self-assigned this Oct 22, 2025
@github-actions github-actions bot added size:XL 500~ LoC comp:manager Related to Manager component comp:agent Related to Agent component labels Oct 22, 2025
@hhoikoo hhoikoo force-pushed the feat/BA-2753/multiple-agents branch from 837e919 to 873b8b8 Compare October 22, 2025 02:03
@hhoikoo hhoikoo changed the title feat(BA-2750): Add support for array in config generator feat(BA-2753): Spawn multiple agents and route RPC appropriately Oct 22, 2025
@hhoikoo hhoikoo force-pushed the feat/BA-2753/multiple-agents branch from 873b8b8 to af5155a Compare October 22, 2025 02:51
@hhoikoo hhoikoo requested a review from Copilot October 22, 2025 04:02
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements support for spawning multiple agent instances within a single agent server process and routing RPC calls to the correct agent using an agent_id field. The changes enable a multi-agent runtime architecture where different agents can be configured with distinct resource allocations, port ranges, and configurations.

Key changes:

  • Added agent_id parameter to all agent RPC methods to enable proper routing
  • Modified AgentRPCServer to manage multiple agent instances via a mapping
  • Introduced AggregateKernelRegistry to provide unified kernel access across agents
  • Updated manager components to pass agent_id when making agent RPC calls

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/manager/sokovan/scheduler/test_terminate_sessions.py Added agent_id parameter to destroy_kernel test calls
tests/agent/test_config_server.py New test file for multi-agent configuration and RPC routing validation
src/ai/backend/manager/sokovan/scheduler/scheduler.py Updated all agent RPC calls to include agent_id parameter
src/ai/backend/manager/sokovan/scheduler/hooks/base.py Added agent_id to network destruction RPC calls
src/ai/backend/manager/registry.py Updated agent RPC client calls to pass agent_id throughout
src/ai/backend/manager/clients/agent/client.py Modified all client methods to accept and forward agent_id
src/ai/backend/agent/server.py Refactored to support multiple agents with routing logic and aggregate registry
src/ai/backend/agent/docker/agent.py Made metadata_server optional and removed per-agent initialization
changes/6320.feature.md Changelog entry for the feature

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@hhoikoo hhoikoo force-pushed the feat/BA-2753/multiple-agents branch 3 times, most recently from 24baee8 to b5f4721 Compare October 23, 2025 07:48
@hhoikoo hhoikoo requested a review from Copilot October 23, 2025 07:49
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@hhoikoo hhoikoo force-pushed the feat/BA-2752/multiple-agents-config branch from 82933d0 to 2b706f0 Compare October 23, 2025 10:35
@hhoikoo hhoikoo force-pushed the feat/BA-2753/multiple-agents branch 4 times, most recently from b1e3834 to 28d181d Compare October 24, 2025 05:00
@hhoikoo hhoikoo force-pushed the feat/BA-2752/multiple-agents-config branch from 2b706f0 to a4dfa0b Compare October 27, 2025 01:38
@hhoikoo hhoikoo force-pushed the feat/BA-2753/multiple-agents branch 3 times, most recently from 5cb52c5 to 9f12687 Compare October 27, 2025 08:28
@hhoikoo hhoikoo force-pushed the feat/BA-2752/multiple-agents-config branch from a4dfa0b to f720f42 Compare November 3, 2025 00:58
@hhoikoo hhoikoo force-pushed the feat/BA-2753/multiple-agents branch from 9f12687 to fdee4b0 Compare November 3, 2025 01:05
@hhoikoo hhoikoo force-pushed the feat/BA-2752/multiple-agents-config branch from f720f42 to daec211 Compare November 4, 2025 06:05
@hhoikoo hhoikoo force-pushed the feat/BA-2753/multiple-agents branch from fdee4b0 to 90f0702 Compare November 4, 2025 06:10
@hhoikoo hhoikoo force-pushed the feat/BA-2752/multiple-agents-config branch from daec211 to 515dea3 Compare November 4, 2025 06:31
@hhoikoo hhoikoo force-pushed the feat/BA-2753/multiple-agents branch from 90f0702 to e2b1902 Compare November 4, 2025 06:35
Comment on lines 26 to 34
class AgentIdNotFoundError(BackendAIError):
@classmethod
def error_code(cls) -> ErrorCode:
return ErrorCode(
domain=ErrorDomain.AGENT,
operation=ErrorOperation.ACCESS,
error_detail=ErrorDetail.NOT_FOUND,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please write the exception in a separate file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I followed the style of manager/ module

Comment on lines 91 to 95
if agent_id is None:
agent_id = self._default_agent_id
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we handle it with _default_agent instead of _default_agent_id, I think we can early return and not have to worry about whether there will be an impact on the code afterwards.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - done

Comment on lines 100 to 111
def get_etcd(self, agent_id: Optional[AgentId]) -> AgentEtcdClientView:
if agent_id is None:
agent_id = self._default_agent_id
if agent_id not in self.agents:
raise AgentIdNotFoundError(
f"Agent '{agent_id}' not found in this runtime. "
f"Available agents: {', '.join(self.agents.keys())}"
)
return self.etcd_views[agent_id]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of repeating the same logic, we call get_agent(agent_id).id() to use the id value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was actually a bug here (the dictionary being checked should not have been self.agents, but rather self.etcd_views) and if that's different, it doesn't really make sense to get access to etcd through agents, especially since now the point at which etcd is initialized is different to when agents are initialized.
Also agent_id is now non-optional as there are no use cases where default agent's etcd is needed.

Comment on lines 125 to 142
async def test_update_scaling_group_persists_single_agent(tmp_path) -> None:
config_file = tmp_path / "agent.toml"
config_file.write_text(
"""[agent]
backend = "dummy"
scaling-group = "default"
id = "test-agent"
[container]
scratch-type = "hostdir"
[resource]
[etcd]
namespace = "test"
addr = { host = "127.0.0.1", port = 2379 }
"""
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please separate the necessary parts for setup into fixtures, and group the test codes based on TestClass.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@hhoikoo hhoikoo force-pushed the feat/BA-2753/multiple-agents branch from d0beb05 to caef07f Compare November 13, 2025 05:22
@hhoikoo hhoikoo force-pushed the feat/BA-2753/multiple-agents branch 2 times, most recently from 01e37e8 to e224698 Compare November 13, 2025 06:26
@hhoikoo hhoikoo force-pushed the refactor/BA-3028 branch 2 times, most recently from 9cd62c9 to a6b40d6 Compare November 13, 2025 07:46
@hhoikoo hhoikoo force-pushed the feat/BA-2753/multiple-agents branch from e224698 to ed9bd99 Compare November 13, 2025 07:52
Comment on lines 340 to 344

self.runtime = await AgentRuntime.new(
self.local_config,
self.etcd,
await self.runtime.create_agents(
self.stats_monitor,
self.error_monitor,
self.rpc_auth_agent_public_key,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking it might be better to create and generate it with classmethod rather than this method.
It's good for all fields to be prepared at the constructor point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

This change adds support for actually spawning multiple agents within
the same agent server and adding agent_id field for all appropriate RPC
calls in the agent server, then ensuring that the manager sends that
info such that the agent server can correctly route the RPC calls to the
correct agent.
@hhoikoo hhoikoo force-pushed the feat/BA-2753/multiple-agents branch from 0247e8a to 95deceb Compare November 14, 2025 00:55
@HyeockJinKim HyeockJinKim merged commit c458af0 into refactor/BA-3028 Nov 14, 2025
4 of 6 checks passed
@HyeockJinKim HyeockJinKim deleted the feat/BA-2753/multiple-agents branch November 14, 2025 04:49
@github-actions github-actions bot added the comp:common Related to Common component label Nov 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:agent Related to Agent component comp:common Related to Common component comp:manager Related to Manager component size:XL 500~ LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants