-
Notifications
You must be signed in to change notification settings - Fork 163
feat(BA-2851): Add resource isolation options for multi-agent #6498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: feat/BA-3024/multi-agent-resources-config
Are you sure you want to change the base?
feat(BA-2851): Add resource isolation options for multi-agent #6498
Conversation
e6c1f4b to
d84258e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces resource isolation options for multi-agent setups, enabling multiple agents to run on the same physical host with controlled resource allocation. The implementation adds three allocation modes: SHARED (default, backward compatible), AUTO_SPLIT (automatic equal division), and MANUAL (explicit per-agent configuration).
Key changes:
- Introduces
ResourcePartitionerclass to manage resource allocation across agents - Adds
ResourceAllocationModeenum with SHARED, AUTO_SPLIT, and MANUAL modes - Implements validation logic to ensure consistent manual allocations across agents
- Updates agent initialization to use resource partitioning
Reviewed Changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 23 comments.
Show a summary per file
| File | Description |
|---|---|
| src/ai/backend/agent/resources.py | Adds ResourcePartitioner class and changes abstract methods to raise NotImplementedError |
| src/ai/backend/agent/config/unified.py | Defines allocation modes, new config fields (allocated_cpu/mem/disk/devices), and validation logic |
| src/ai/backend/agent/agent.py | Integrates ResourcePartitioner into agent initialization and updates slot calculations |
| src/ai/backend/agent/server.py | Creates ResourcePartitioner instances per agent and adds resource reconciliation |
| src/ai/backend/agent/docker/agent.py | Adds resource_partitioner parameter to constructor |
| src/ai/backend/agent/kubernetes/agent.py | Adds resource_partitioner parameter to constructor |
| tests/agent/test_resource_allocation.py | Comprehensive unit tests for all three allocation modes |
| tests/agent/test_config_validation.py | Tests for config validation of allocation modes and device consistency |
| tests/agent/docker/test_agent.py | Updates test to pass ResourcePartitioner to agent |
| changes/6498.feature.md | Changelog entry |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
d84258e to
c5114a9
Compare
9f12687 to
fdee4b0
Compare
310d847 to
3faac0f
Compare
fdee4b0 to
90f0702
Compare
36824ac to
279e71b
Compare
90f0702 to
e2b1902
Compare
280831f to
db07080
Compare
ce120ef to
13c7be6
Compare
e2b1902 to
9c34302
Compare
80ecc2c to
7462fd8
Compare
9c34302 to
04a0f3a
Compare
7462fd8 to
e17be1c
Compare
04a0f3a to
92c3bd7
Compare
e17be1c to
d936ce3
Compare
92c3bd7 to
f0f7510
Compare
1271077 to
1169ebb
Compare
4544fbb to
b3ad26a
Compare
d5593fd to
e808760
Compare
80bf8ad to
cf33c40
Compare
103dbcc to
6d2e3d2
Compare
3b906c6 to
0f2e48a
Compare
79d9b6a to
a6f3c04
Compare
0f2e48a to
3c536c3
Compare
a6f3c04 to
73bf45c
Compare
3c536c3 to
4f27da5
Compare
73bf45c to
66b563f
Compare
This change implements configuration for partitioning resources. SHARED mode allows all agents to see full resources (useful for stress testing). This is the same behavior as before. AUTO_SPLIT automatically divides resources equally among agents. MANUAL mode lets users specify exact per-agent allocations for all resources. Single-agent deployments remain unaffected and retain access to all available hardware resources.
4f27da5 to
de52206
Compare
As now the AgentRuntime will handle device allocations, it is no longer appropriate for individual agents to have potentially different plugin configurations. In fact, it was never appropriate for different agents on the same physical server to have different plugins loaded.
This change modifies the semantics of ResourcePartitioner so that it now takes ownership over the devices and injects partitioned devices to individual agents after initialization.
There was an issue with AgentRuntime, where the initialization code for Agent Server required use of etcd views of individual agents. This change fixes this by splitting the initialization of agent runtime into two phases: initializing non-async parameters (including etcd views), and initializing agents themselves.
There was an issue where agent config generation was done too early, as redis config injection was not applied. This change fixes it.
This change fixes a bug with resource splitting, where reserved resources were accidentally being included in the total allocated for each agent. This is because the way total slots are handled was malformed, where the calculation of reserved resources from the perspective of a single agent was being done without taking account of server reserved resources properly. This change fixes this issue by inverting the condition, where reserved resources are deducted only in places where it is needed.
resolves #6432 (BA-2851)
This change adds configuration for partitioning resources rather than every agent always seeing the full resource pool. This prevents unintended over-allocation that could crash kernels.
Single-agent deployments remain unaffected and retain access to all available hardware resources.
Checklist: (if applicable)
ai.backend.testdocsdirectory