-
Notifications
You must be signed in to change notification settings - Fork 163
feat(BA-2753): Spawn multiple agents and route RPC appropriately #6320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
837e919 to
873b8b8
Compare
873b8b8 to
af5155a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements support for spawning multiple agent instances within a single agent server process and routing RPC calls to the correct agent using an agent_id field. The changes enable a multi-agent runtime architecture where different agents can be configured with distinct resource allocations, port ranges, and configurations.
Key changes:
- Added
agent_idparameter to all agent RPC methods to enable proper routing - Modified
AgentRPCServerto manage multiple agent instances via a mapping - Introduced
AggregateKernelRegistryto provide unified kernel access across agents - Updated manager components to pass
agent_idwhen making agent RPC calls
Reviewed Changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
tests/manager/sokovan/scheduler/test_terminate_sessions.py |
Added agent_id parameter to destroy_kernel test calls |
tests/agent/test_config_server.py |
New test file for multi-agent configuration and RPC routing validation |
src/ai/backend/manager/sokovan/scheduler/scheduler.py |
Updated all agent RPC calls to include agent_id parameter |
src/ai/backend/manager/sokovan/scheduler/hooks/base.py |
Added agent_id to network destruction RPC calls |
src/ai/backend/manager/registry.py |
Updated agent RPC client calls to pass agent_id throughout |
src/ai/backend/manager/clients/agent/client.py |
Modified all client methods to accept and forward agent_id |
src/ai/backend/agent/server.py |
Refactored to support multiple agents with routing logic and aggregate registry |
src/ai/backend/agent/docker/agent.py |
Made metadata_server optional and removed per-agent initialization |
changes/6320.feature.md |
Changelog entry for the feature |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
24baee8 to
b5f4721
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
82933d0 to
2b706f0
Compare
b1e3834 to
28d181d
Compare
2b706f0 to
a4dfa0b
Compare
5cb52c5 to
9f12687
Compare
a4dfa0b to
f720f42
Compare
9f12687 to
fdee4b0
Compare
f720f42 to
daec211
Compare
fdee4b0 to
90f0702
Compare
daec211 to
515dea3
Compare
90f0702 to
e2b1902
Compare
src/ai/backend/agent/runtime.py
Outdated
| class AgentIdNotFoundError(BackendAIError): | ||
| @classmethod | ||
| def error_code(cls) -> ErrorCode: | ||
| return ErrorCode( | ||
| domain=ErrorDomain.AGENT, | ||
| operation=ErrorOperation.ACCESS, | ||
| error_detail=ErrorDetail.NOT_FOUND, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please write the exception in a separate file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I followed the style of manager/ module
src/ai/backend/agent/runtime.py
Outdated
| if agent_id is None: | ||
| agent_id = self._default_agent_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we handle it with _default_agent instead of _default_agent_id, I think we can early return and not have to worry about whether there will be an impact on the code afterwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point - done
src/ai/backend/agent/runtime.py
Outdated
| def get_etcd(self, agent_id: Optional[AgentId]) -> AgentEtcdClientView: | ||
| if agent_id is None: | ||
| agent_id = self._default_agent_id | ||
| if agent_id not in self.agents: | ||
| raise AgentIdNotFoundError( | ||
| f"Agent '{agent_id}' not found in this runtime. " | ||
| f"Available agents: {', '.join(self.agents.keys())}" | ||
| ) | ||
| return self.etcd_views[agent_id] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of repeating the same logic, we call get_agent(agent_id).id() to use the id value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was actually a bug here (the dictionary being checked should not have been self.agents, but rather self.etcd_views) and if that's different, it doesn't really make sense to get access to etcd through agents, especially since now the point at which etcd is initialized is different to when agents are initialized.
Also agent_id is now non-optional as there are no use cases where default agent's etcd is needed.
tests/agent/test_agent.py
Outdated
| async def test_update_scaling_group_persists_single_agent(tmp_path) -> None: | ||
| config_file = tmp_path / "agent.toml" | ||
| config_file.write_text( | ||
| """[agent] | ||
| backend = "dummy" | ||
| scaling-group = "default" | ||
| id = "test-agent" | ||
| [container] | ||
| scratch-type = "hostdir" | ||
| [resource] | ||
| [etcd] | ||
| namespace = "test" | ||
| addr = { host = "127.0.0.1", port = 2379 } | ||
| """ | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please separate the necessary parts for setup into fixtures, and group the test codes based on TestClass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
386d3c6 to
09b2462
Compare
d0beb05 to
caef07f
Compare
09b2462 to
f442d87
Compare
01e37e8 to
e224698
Compare
9cd62c9 to
a6b40d6
Compare
e224698 to
ed9bd99
Compare
|
|
||
| self.runtime = await AgentRuntime.new( | ||
| self.local_config, | ||
| self.etcd, | ||
| await self.runtime.create_agents( | ||
| self.stats_monitor, | ||
| self.error_monitor, | ||
| self.rpc_auth_agent_public_key, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking it might be better to create and generate it with classmethod rather than this method.
It's good for all fields to be prepared at the constructor point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
a6b40d6 to
324e900
Compare
This change adds support for actually spawning multiple agents within the same agent server and adding agent_id field for all appropriate RPC calls in the agent server, then ensuring that the manager sends that info such that the agent server can correctly route the RPC calls to the correct agent.
0247e8a to
95deceb
Compare
resolves #6314 (BA-2753)
This PR adds support for actually spawning multiple agents within the same agent server and adding agent_id field for all appropriate RPC calls in the agent server, then ensuring that the manager sends that info such that the agent server can correctly route the RPC calls to the correct agent.
Checklist: (if applicable)
ai.backend.testdocsdirectory