Skip to content

[BREAKING][single_controller] refactor: remove Ray leakage from base interface#5463

Draft
JenniferWang wants to merge 2 commits intoverl-project:mainfrom
JenniferWang:single-controller-no-rename
Draft

[BREAKING][single_controller] refactor: remove Ray leakage from base interface#5463
JenniferWang wants to merge 2 commits intoverl-project:mainfrom
JenniferWang:single-controller-no-rename

Conversation

@JenniferWang
Copy link

@JenniferWang JenniferWang commented Mar 2, 2026

What does this PR do?

Removes Ray-specific concepts from verl/single_controller/base/ so the base interface is backend-agnostic. Previously Worker called ray.util.get_node_ip_address(), ResourcePoolManager was a Ray-coupled dataclass, and decorator.py did inline import ray throughout. This PR establishes a clean abstraction boundary and introduces an explicit backend extension model. Additionally, the ray_trainer.py is now no longer tied to Ray -- it is backend injectable.

Test

CPU tests:

uv run pytest -s -x tests/single_controller/test_ray_class_with_init_args_on_cpu.py

Main GRPO training can run e2e (verl.trainer.main_ppo)

Design & Code Changes

Backend Extension Pattern

User-defined worker classes only inherit from the base Worker class which is backend agnostic.

class MyActorWorker(Worker):
    ...

Backend-specific logic is handled via a Mixin injected automatically by the backend's worker group. When a caller uses RayWorkerGroup, RayClassWithInitArgs.__init__ dynamically injects RayWorkerMixin into any plain Worker subclass before wrapping it with ray.remote. RayWorkerMixin provides the Ray-specific implementations of _get_node_ip() and the NOSET device setup logic that previously lived in the base Worker class.

base/worker.py: WorkerHelper._get_node_ip made abstract; Ray-specific NOSET device logic removed from _setup_env_cuda_visible_devices.

base/worker_group.py: WorkerGroup gains explicit extension points (cls_with_init_args, create_colocated_worker_cls, get, worker_group_kwargs, backend). ResourcePoolManager becomes a pure ABC with three abstract methods — no fields, no Ray imports.

base/decorator.py: parallel_put gated behind worker_group.backend == "ray".

ray/base.py: RayWorkerMixin introduced and auto-injected by RayClassWithInitArgs. RayResourcePoolManager implements the ABC. RayWorkerGroup implements all WorkerGroup extension points.

ray_trainer.py: RayPPOTrainer renamed to PPOTrainer; RayPPOTrainer = PPOTrainer kept as a backwards-compatibility alias. The trainer is effectively backend injectable.

API Changes

Before — user classes had to be wrapped at the call site:

role_worker_mapping = {
    Role.ActorRollout: ray.remote(ActorRolloutRefWorker),
    Role.Critic: ray.remote(CriticWorker),
}
ray_cls = RayClassWithInitArgs(cls=ray.remote(MyWorker), ...)

After — pass plain classes; the backend handles wrapping:

role_worker_mapping = {
    Role.ActorRollout: ActorRolloutRefWorker,
    Role.Critic: CriticWorker,
}
ray_cls = RayClassWithInitArgs(cls=MyWorker, ...)

BREAKING: RayClassWithInitArgs now raises ValueError if passed an already-@ray.remote-decorated class. This is to simplify the codebase and there's no safe way to silently stripping the decoration and preserve the options (e.g. num_gpus, max_restarts) set on it.

Deprecation shims preserved:

  • PPOTrainer is the new name; RayPPOTrainer = PPOTrainer kept as alias
  • RayWorkerGroup.__init__ accepts deprecated ray_cls_with_init kwarg
  • ResourcePoolManager alias in ray/__init__.py -> RayResourcePoolManager

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

@CLAassistant
Copy link

CLAassistant commented Mar 2, 2026

CLA assistant check
All committers have signed the CLA.

@JenniferWang JenniferWang marked this pull request as draft March 2, 2026 21:17
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a significant and well-executed refactoring to remove Ray-specific logic from the base single-controller interfaces, making them backend-agnostic. The introduction of a RayWorkerMixin and an explicit backend extension model in WorkerGroup is a clean approach to separating concerns. The changes are extensive and consistently applied across the codebase.

My review has identified a couple of issues: one critical issue regarding a breaking API change that will cause a TypeError, and one high-severity issue related to confusing deprecation handling in a core component. Addressing these will ensure the refactoring is robust and maintainable.

Note: Security Review is unavailable for this PR.

@JenniferWang JenniferWang force-pushed the single-controller-no-rename branch from 89b587e to 0c3e73f Compare March 2, 2026 21:58
@JenniferWang
Copy link
Author

/gemini review

@JenniferWang JenniferWang marked this pull request as ready for review March 2, 2026 22:05
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a significant and well-executed refactoring to decouple the base single-controller interface from Ray-specific implementations, making it backend-agnostic. The introduction of a mixin-based extension model, abstract base classes for core components, and the migration of Ray-specific logic are excellent design choices. The changes are consistent and the new tests for the mixin injection are thorough. I've identified two main issues: a critical bug in the dynamic class creation that could break serialization, and a leaky abstraction in the base decorator that undermines the goal of being fully backend-agnostic. Addressing these will make the new abstraction more robust and maintainable.

…single controller interface

### What does this PR do?

Removes Ray-specific concepts from `verl/single_controller/base/` so the base
interface is backend-agnostic. Previously `Worker` called
`ray.util.get_node_ip_address()`, `ResourcePoolManager` was a Ray-coupled
dataclass, and `decorator.py` did inline `import ray` throughout. This PR
establishes a clean abstraction boundary and introduces an explicit backend
extension model.

### Test

GPU tests (single node, ran on dev server):
```
uv run pytest -s -x \
  tests/single_controller/test_colocated_workers.py \
  tests/single_controller/test_colocated_workers_fused.py \
  tests/single_controller/test_data_transfer.py \
  tests/single_controller/test_device_mesh_register.py \
  tests/single_controller/test_driverfunc_to_worker.py \
  tests/single_controller/test_high_level_scheduling_api.py \
  tests/single_controller/test_nested_worker.py \
  tests/single_controller/test_ray_collectives.py \
  tests/single_controller/test_split_resource_pool.py \
  tests/single_controller/test_worker_group_basics.py \
  tests/single_controller/test_worker_group_torch.py
```

CPU tests:
```
uv run pytest -s -x tests/single_controller/test_ray_class_with_init_args_on_cpu.py
```

All passed.

### API and Usage Example

**Before** — user classes had to be wrapped at the call site:

```python
role_worker_mapping = {
    Role.ActorRollout: ray.remote(ActorRolloutRefWorker),
    Role.Critic: ray.remote(CriticWorker),
}
ray_cls = RayClassWithInitArgs(cls=ray.remote(MyWorker), ...)
```

**After** — pass plain classes; the backend handles wrapping:

```python
role_worker_mapping = {
    Role.ActorRollout: ActorRolloutRefWorker,
    Role.Critic: CriticWorker,
}
ray_cls = RayClassWithInitArgs(cls=MyWorker, ...)
```

**BREAKING**: `RayClassWithInitArgs` now raises `ValueError` if passed an
already-`@ray.remote`-decorated class. Silently stripping the decoration
would discard any options (e.g. `num_gpus`, `max_restarts`) set on it.

Deprecation shims preserved:
- `PPOTrainer` is the new name; `RayPPOTrainer = PPOTrainer` kept as alias
- `RayWorkerGroup.__init__` accepts deprecated `ray_cls_with_init` kwarg
- `ResourcePoolManager` alias in `ray/__init__.py` -> `RayResourcePoolManager`

### Design & Code Changes

**Pattern introduced**

User-defined worker classes now only inherit from the base `Worker`:

```python
class MyActorWorker(Worker):
    ...
```

Backend-specific logic is handled via a Mixin injected automatically by the
backend's worker group. When a caller uses `RayWorkerGroup`,
`RayClassWithInitArgs.__init__` dynamically injects `RayWorkerMixin` into any
plain `Worker` subclass before wrapping it with `ray.remote`. `RayWorkerMixin`
provides the Ray-specific implementations of `_get_node_ip()` and the NOSET
device setup logic that previously lived in the base `Worker` class.

**`base/worker.py`**: `WorkerHelper._get_node_ip` made abstract; Ray-specific
NOSET device logic removed from `_setup_env_cuda_visible_devices`.

**`base/worker_group.py`**: `WorkerGroup` gains explicit extension points
(`cls_with_init_args`, `create_colocated_worker_cls`, `get`,
`worker_group_kwargs`, `backend`). `ResourcePoolManager` becomes a pure ABC
with three abstract methods — no fields, no Ray imports.

**`base/decorator.py`**: `parallel_put` gated behind
`worker_group.backend == "ray"`.

**`ray/base.py`**: `RayWorkerMixin` introduced and auto-injected by
`RayClassWithInitArgs`. `RayResourcePoolManager` implements the ABC.
`RayWorkerGroup` implements all `WorkerGroup` extension points.

**`ray_trainer.py`**: `RayPPOTrainer` renamed to `PPOTrainer`;
`RayPPOTrainer = PPOTrainer` kept as a backwards-compatibility alias.
@JenniferWang JenniferWang force-pushed the single-controller-no-rename branch from 38f0100 to f1a54c4 Compare March 2, 2026 22:12
@JenniferWang JenniferWang marked this pull request as draft March 6, 2026 02:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants