feat(BA-2753): Spawn multiple agents and route RPC appropriately #6320

hhoikoo · 2025-10-22T02:02:12Z

This PR adds support for actually spawning multiple agents within the same agent server and adding agent_id field for all appropriate RPC calls in the agent server, then ensuring that the manager sends that info such that the agent server can correctly route the RPC calls to the correct agent.

Checklist: (if applicable)

Milestone metadata specifying the target backport version
Mention to the original issue
Installer updates including:
- Fixtures for db schema changes
- New mandatory config options
Update of end-to-end CLI integration tests in ai.backend.test
API server-client counterparts (e.g., manager API -> client SDK)
Test case(s) to:
- Demonstrate the difference of before/after
- Demonstrate the flow of abstract/conceptual models with a concrete implementation
Documentation
- Contents in the docs directory
- docstrings in public interfaces and type annotations

Copilot

Pull Request Overview

This PR implements support for spawning multiple agent instances within a single agent server process and routing RPC calls to the correct agent using an agent_id field. The changes enable a multi-agent runtime architecture where different agents can be configured with distinct resource allocations, port ranges, and configurations.

Key changes:

Added agent_id parameter to all agent RPC methods to enable proper routing
Modified AgentRPCServer to manage multiple agent instances via a mapping
Introduced AggregateKernelRegistry to provide unified kernel access across agents
Updated manager components to pass agent_id when making agent RPC calls

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`tests/manager/sokovan/scheduler/test_terminate_sessions.py`	Added agent_id parameter to destroy_kernel test calls
`tests/agent/test_config_server.py`	New test file for multi-agent configuration and RPC routing validation
`src/ai/backend/manager/sokovan/scheduler/scheduler.py`	Updated all agent RPC calls to include agent_id parameter
`src/ai/backend/manager/sokovan/scheduler/hooks/base.py`	Added agent_id to network destruction RPC calls
`src/ai/backend/manager/registry.py`	Updated agent RPC client calls to pass agent_id throughout
`src/ai/backend/manager/clients/agent/client.py`	Modified all client methods to accept and forward agent_id
`src/ai/backend/agent/server.py`	Refactored to support multiple agents with routing logic and aggregate registry
`src/ai/backend/agent/docker/agent.py`	Made metadata_server optional and removed per-agent initialization
`changes/6320.feature.md`	Changelog entry for the feature

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

src/ai/backend/agent/docker/agent.py

src/ai/backend/manager/clients/agent/client.py

Copilot

Pull Request Overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

src/ai/backend/manager/registry.py

src/ai/backend/manager/clients/agent/client.py

src/ai/backend/agent/server.py

src/ai/backend/agent/config/unified.py

src/ai/backend/manager/clients/agent/client.py

HyeockJinKim · 2025-11-13T02:33:08Z

src/ai/backend/agent/runtime.py

+class AgentIdNotFoundError(BackendAIError):
+    @classmethod
+    def error_code(cls) -> ErrorCode:
+        return ErrorCode(
+            domain=ErrorDomain.AGENT,
+            operation=ErrorOperation.ACCESS,
+            error_detail=ErrorDetail.NOT_FOUND,
+        )


Please write the exception in a separate file.

Done. I followed the style of manager/ module

HyeockJinKim · 2025-11-13T02:36:46Z

src/ai/backend/agent/runtime.py

+        if agent_id is None:
+            agent_id = self._default_agent_id


If we handle it with _default_agent instead of _default_agent_id, I think we can early return and not have to worry about whether there will be an impact on the code afterwards.

Good point - done

HyeockJinKim · 2025-11-13T02:37:36Z

src/ai/backend/agent/runtime.py

+    def get_etcd(self, agent_id: Optional[AgentId]) -> AgentEtcdClientView:
+        if agent_id is None:
+            agent_id = self._default_agent_id
+        if agent_id not in self.agents:
+            raise AgentIdNotFoundError(
+                f"Agent '{agent_id}' not found in this runtime. "
+                f"Available agents: {', '.join(self.agents.keys())}"
+            )
+        return self.etcd_views[agent_id]


Instead of repeating the same logic, we call get_agent(agent_id).id() to use the id value.

There was actually a bug here (the dictionary being checked should not have been self.agents, but rather self.etcd_views) and if that's different, it doesn't really make sense to get access to etcd through agents, especially since now the point at which etcd is initialized is different to when agents are initialized.
Also agent_id is now non-optional as there are no use cases where default agent's etcd is needed.

HyeockJinKim · 2025-11-13T02:59:39Z

tests/agent/test_agent.py

+async def test_update_scaling_group_persists_single_agent(tmp_path) -> None:
+    config_file = tmp_path / "agent.toml"
+    config_file.write_text(
+        """[agent]
+backend = "dummy"
+scaling-group = "default"
+id = "test-agent"
+
+[container]
+scratch-type = "hostdir"
+
+[resource]
+
+[etcd]
+namespace = "test"
+addr = { host = "127.0.0.1", port = 2379 }
+"""
+    )


Please separate the necessary parts for setup into fixtures, and group the test codes based on TestClass.

HyeockJinKim · 2025-11-14T00:24:28Z

src/ai/backend/agent/server.py


-        self.runtime = await AgentRuntime.new(
-            self.local_config,
-            self.etcd,
+        await self.runtime.create_agents(
            self.stats_monitor,
            self.error_monitor,
            self.rpc_auth_agent_public_key,


I'm thinking it might be better to create and generate it with classmethod rather than this method.
It's good for all fields to be prepared at the constructor point.

This change adds support for actually spawning multiple agents within the same agent server and adding agent_id field for all appropriate RPC calls in the agent server, then ensuring that the manager sends that info such that the agent server can correctly route the RPC calls to the correct agent.

hhoikoo self-assigned this Oct 22, 2025

github-actions bot added size:XL 500~ LoC comp:manager Related to Manager component comp:agent Related to Agent component labels Oct 22, 2025

hhoikoo force-pushed the feat/BA-2753/multiple-agents branch from 837e919 to 873b8b8 Compare October 22, 2025 02:03

hhoikoo changed the title ~~feat(BA-2750): Add support for array in config generator~~ feat(BA-2753): Spawn multiple agents and route RPC appropriately Oct 22, 2025

hhoikoo force-pushed the feat/BA-2753/multiple-agents branch from 873b8b8 to af5155a Compare October 22, 2025 02:51

hhoikoo requested a review from Copilot October 22, 2025 04:02

Copilot AI reviewed Oct 22, 2025

View reviewed changes

src/ai/backend/agent/docker/agent.py Outdated Show resolved Hide resolved

src/ai/backend/manager/clients/agent/client.py Outdated Show resolved Hide resolved

hhoikoo force-pushed the feat/BA-2753/multiple-agents branch 3 times, most recently from 24baee8 to b5f4721 Compare October 23, 2025 07:48

hhoikoo requested a review from Copilot October 23, 2025 07:49

Copilot AI reviewed Oct 23, 2025

View reviewed changes

HyeockJinKim reviewed Oct 23, 2025

View reviewed changes

src/ai/backend/agent/config/unified.py Outdated Show resolved Hide resolved

src/ai/backend/manager/clients/agent/client.py Outdated Show resolved Hide resolved

hhoikoo force-pushed the feat/BA-2752/multiple-agents-config branch from 82933d0 to 2b706f0 Compare October 23, 2025 10:35

hhoikoo force-pushed the feat/BA-2753/multiple-agents branch 4 times, most recently from b1e3834 to 28d181d Compare October 24, 2025 05:00

hhoikoo force-pushed the feat/BA-2752/multiple-agents-config branch from 2b706f0 to a4dfa0b Compare October 27, 2025 01:38

hhoikoo force-pushed the feat/BA-2753/multiple-agents branch 3 times, most recently from 5cb52c5 to 9f12687 Compare October 27, 2025 08:28

hhoikoo force-pushed the feat/BA-2752/multiple-agents-config branch from a4dfa0b to f720f42 Compare November 3, 2025 00:58

hhoikoo force-pushed the feat/BA-2753/multiple-agents branch from 9f12687 to fdee4b0 Compare November 3, 2025 01:05

hhoikoo force-pushed the feat/BA-2752/multiple-agents-config branch from f720f42 to daec211 Compare November 4, 2025 06:05

hhoikoo force-pushed the feat/BA-2753/multiple-agents branch from fdee4b0 to 90f0702 Compare November 4, 2025 06:10

hhoikoo force-pushed the feat/BA-2752/multiple-agents-config branch from daec211 to 515dea3 Compare November 4, 2025 06:31

hhoikoo force-pushed the feat/BA-2753/multiple-agents branch from 90f0702 to e2b1902 Compare November 4, 2025 06:35

HyeockJinKim reviewed Nov 13, 2025

View reviewed changes

hhoikoo force-pushed the refactor/BA-3028 branch from 386d3c6 to 09b2462 Compare November 13, 2025 04:56

hhoikoo force-pushed the feat/BA-2753/multiple-agents branch from d0beb05 to caef07f Compare November 13, 2025 05:22

hhoikoo force-pushed the refactor/BA-3028 branch from 09b2462 to f442d87 Compare November 13, 2025 06:02

hhoikoo force-pushed the feat/BA-2753/multiple-agents branch 2 times, most recently from 01e37e8 to e224698 Compare November 13, 2025 06:26

hhoikoo force-pushed the refactor/BA-3028 branch 2 times, most recently from 9cd62c9 to a6b40d6 Compare November 13, 2025 07:46

hhoikoo force-pushed the feat/BA-2753/multiple-agents branch from e224698 to ed9bd99 Compare November 13, 2025 07:52

HyeockJinKim reviewed Nov 14, 2025

View reviewed changes

hhoikoo force-pushed the refactor/BA-3028 branch from a6b40d6 to 324e900 Compare November 14, 2025 00:53

hhoikoo added 8 commits November 14, 2025 09:54

refactor(BA-2753): Move MetadataServer out of DockerAgent

61a934a

refactor(BA-2753): Make all fields in AgentRuntime private

f93e3bc

refactor(BA-2753): Respond to feedback

e106bbe

refactor(BA-2753): Move error classes to its own module

3145b0e

test(BA-2753): Remove malformed test

0576837

test(BA-2753): Add tests

21ae7e2

test(BA-2753): Organize tests into classes

95deceb

hhoikoo force-pushed the feat/BA-2753/multiple-agents branch from 0247e8a to 95deceb Compare November 14, 2025 00:55

hhoikoo added 2 commits November 14, 2025 10:43

feat(BA-2753): Add static method for constructing AgentRuntime

4f91c5d

fix(BA-2753): Actually initialize MetadataServer

3d601af

HyeockJinKim approved these changes Nov 14, 2025

View reviewed changes

feat(BA-3024): Add custom resource alloc for agents in config (#6724)

4aa0fce

HyeockJinKim merged commit c458af0 into refactor/BA-3028 Nov 14, 2025
4 of 6 checks passed

HyeockJinKim deleted the feat/BA-2753/multiple-agents branch November 14, 2025 04:49

github-actions bot added the comp:common Related to Common component label Nov 14, 2025

feat(BA-2753): Spawn multiple agents and route RPC appropriately #6320

feat(BA-2753): Spawn multiple agents and route RPC appropriately #6320

Uh oh!

Conversation

hhoikoo commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hhoikoo commented Oct 22, 2025 •

edited

Loading