Skip to content

Conversation

@hhoikoo
Copy link
Member

@hhoikoo hhoikoo commented Oct 31, 2025

resolves #6432 (BA-2851)

This change adds configuration for partitioning resources rather than every agent always seeing the full resource pool. This prevents unintended over-allocation that could crash kernels.

  • SHARED: allows all agents to see full resources (useful for stress testing). This is the same behavior as before.
  • AUTO_SPLIT: automatically divides resources equally among agents.
  • MANUAL: lets users specify exact per-agent allocations for all resources.

Single-agent deployments remain unaffected and retain access to all available hardware resources.

Checklist: (if applicable)

  • Milestone metadata specifying the target backport version
  • Mention to the original issue
  • Installer updates including:
    • Fixtures for db schema changes
    • New mandatory config options
  • Update of end-to-end CLI integration tests in ai.backend.test
  • API server-client counterparts (e.g., manager API -> client SDK)
  • Test case(s) to:
    • Demonstrate the difference of before/after
    • Demonstrate the flow of abstract/conceptual models with a concrete implementation
  • Documentation
    • Contents in the docs directory
    • docstrings in public interfaces and type annotations

@github-actions github-actions bot added size:XL 500~ LoC comp:agent Related to Agent component labels Oct 31, 2025
@hhoikoo hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch from e6c1f4b to d84258e Compare October 31, 2025 01:30
@hhoikoo hhoikoo requested a review from Copilot October 31, 2025 01:30
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces resource isolation options for multi-agent setups, enabling multiple agents to run on the same physical host with controlled resource allocation. The implementation adds three allocation modes: SHARED (default, backward compatible), AUTO_SPLIT (automatic equal division), and MANUAL (explicit per-agent configuration).

Key changes:

  • Introduces ResourcePartitioner class to manage resource allocation across agents
  • Adds ResourceAllocationMode enum with SHARED, AUTO_SPLIT, and MANUAL modes
  • Implements validation logic to ensure consistent manual allocations across agents
  • Updates agent initialization to use resource partitioning

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 23 comments.

Show a summary per file
File Description
src/ai/backend/agent/resources.py Adds ResourcePartitioner class and changes abstract methods to raise NotImplementedError
src/ai/backend/agent/config/unified.py Defines allocation modes, new config fields (allocated_cpu/mem/disk/devices), and validation logic
src/ai/backend/agent/agent.py Integrates ResourcePartitioner into agent initialization and updates slot calculations
src/ai/backend/agent/server.py Creates ResourcePartitioner instances per agent and adds resource reconciliation
src/ai/backend/agent/docker/agent.py Adds resource_partitioner parameter to constructor
src/ai/backend/agent/kubernetes/agent.py Adds resource_partitioner parameter to constructor
tests/agent/test_resource_allocation.py Comprehensive unit tests for all three allocation modes
tests/agent/test_config_validation.py Tests for config validation of allocation modes and device consistency
tests/agent/docker/test_agent.py Updates test to pass ResourcePartitioner to agent
changes/6498.feature.md Changelog entry

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@hhoikoo hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch from d84258e to c5114a9 Compare October 31, 2025 03:56
@hhoikoo hhoikoo force-pushed the feat/BA-2753/multiple-agents branch from 9f12687 to fdee4b0 Compare November 3, 2025 01:05
@hhoikoo hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch 3 times, most recently from 310d847 to 3faac0f Compare November 4, 2025 06:02
@hhoikoo hhoikoo force-pushed the feat/BA-2753/multiple-agents branch from fdee4b0 to 90f0702 Compare November 4, 2025 06:10
@hhoikoo hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch 2 times, most recently from 36824ac to 279e71b Compare November 4, 2025 06:30
@hhoikoo hhoikoo force-pushed the feat/BA-2753/multiple-agents branch from 90f0702 to e2b1902 Compare November 4, 2025 06:35
@hhoikoo hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch 3 times, most recently from 280831f to db07080 Compare November 4, 2025 10:32
@github-actions github-actions bot added the comp:manager Related to Manager component label Nov 4, 2025
@hhoikoo hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch 3 times, most recently from ce120ef to 13c7be6 Compare November 6, 2025 01:01
@hhoikoo hhoikoo force-pushed the feat/BA-2753/multiple-agents branch from e2b1902 to 9c34302 Compare November 6, 2025 01:08
@hhoikoo hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch 2 times, most recently from 80ecc2c to 7462fd8 Compare November 6, 2025 01:10
@hhoikoo hhoikoo force-pushed the feat/BA-2753/multiple-agents branch from 9c34302 to 04a0f3a Compare November 6, 2025 01:47
@hhoikoo hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch from 7462fd8 to e17be1c Compare November 6, 2025 01:52
@hhoikoo hhoikoo force-pushed the feat/BA-2753/multiple-agents branch from 04a0f3a to 92c3bd7 Compare November 6, 2025 01:58
@hhoikoo hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch from e17be1c to d936ce3 Compare November 6, 2025 01:59
@hhoikoo hhoikoo force-pushed the feat/BA-2753/multiple-agents branch from 92c3bd7 to f0f7510 Compare November 6, 2025 02:07
@hhoikoo hhoikoo force-pushed the feat/BA-2753/multiple-agents branch 2 times, most recently from 1271077 to 1169ebb Compare November 10, 2025 10:01
@hhoikoo hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch 2 times, most recently from 4544fbb to b3ad26a Compare November 10, 2025 10:26
@hhoikoo hhoikoo force-pushed the feat/BA-2753/multiple-agents branch 3 times, most recently from d5593fd to e808760 Compare November 11, 2025 02:42
@hhoikoo hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch 2 times, most recently from 80bf8ad to cf33c40 Compare November 11, 2025 04:06
@hhoikoo hhoikoo changed the base branch from feat/BA-2753/multiple-agents to feat/BA-3024/multi-agent-resources-config November 11, 2025 04:06
@hhoikoo hhoikoo force-pushed the feat/BA-3024/multi-agent-resources-config branch from 103dbcc to 6d2e3d2 Compare November 11, 2025 04:48
@hhoikoo hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch 2 times, most recently from 3b906c6 to 0f2e48a Compare November 11, 2025 05:55
@hhoikoo hhoikoo force-pushed the feat/BA-3024/multi-agent-resources-config branch 2 times, most recently from 79d9b6a to a6f3c04 Compare November 12, 2025 02:18
@hhoikoo hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch from 0f2e48a to 3c536c3 Compare November 12, 2025 04:13
@hhoikoo hhoikoo force-pushed the feat/BA-3024/multi-agent-resources-config branch from a6f3c04 to 73bf45c Compare November 12, 2025 04:26
@hhoikoo hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch from 3c536c3 to 4f27da5 Compare November 12, 2025 04:31
@hhoikoo hhoikoo force-pushed the feat/BA-3024/multi-agent-resources-config branch from 73bf45c to 66b563f Compare November 12, 2025 05:06
This change implements configuration for partitioning resources.

SHARED mode allows all agents to see full resources (useful for
stress testing). This is the same behavior as before.
AUTO_SPLIT automatically divides resources equally among agents.
MANUAL mode lets users specify exact per-agent allocations for all
resources.

Single-agent deployments remain unaffected and retain access to all
available hardware resources.
@hhoikoo hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch from 4f27da5 to de52206 Compare November 12, 2025 05:14
As now the AgentRuntime will handle device allocations, it is no longer
appropriate for individual agents to have potentially different plugin
configurations. In fact, it was never appropriate for different agents
on the same physical server to have different plugins loaded.
This change modifies the semantics of ResourcePartitioner so that it now
takes ownership over the devices and injects partitioned devices to
individual agents after initialization.
@github-actions github-actions bot added the comp:common Related to Common component label Nov 12, 2025
@hhoikoo hhoikoo removed the comp:manager Related to Manager component label Nov 12, 2025
There was an issue with AgentRuntime, where the initialization code for
Agent Server required use of etcd views of individual agents. This
change fixes this by splitting the initialization of agent runtime into
two phases: initializing non-async parameters (including etcd views),
and initializing agents themselves.
There was an issue where agent config generation was done too early,
as redis config injection was not applied. This change fixes it.
This change fixes a bug with resource splitting, where reserved
resources were accidentally being included in the total allocated for
each agent. This is because the way total slots are handled was
malformed, where the calculation of reserved resources from the
perspective of a single agent was being done without taking account of
server reserved resources properly. This change fixes this issue by
inverting the condition, where reserved resources are deducted only in
places where it is needed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:agent Related to Agent component comp:common Related to Common component size:XL 500~ LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants