[BREAKING][single_controller] refactor: remove Ray leakage from base interface#5463
[BREAKING][single_controller] refactor: remove Ray leakage from base interface#5463JenniferWang wants to merge 2 commits intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request is a significant and well-executed refactoring to remove Ray-specific logic from the base single-controller interfaces, making them backend-agnostic. The introduction of a RayWorkerMixin and an explicit backend extension model in WorkerGroup is a clean approach to separating concerns. The changes are extensive and consistently applied across the codebase.
My review has identified a couple of issues: one critical issue regarding a breaking API change that will cause a TypeError, and one high-severity issue related to confusing deprecation handling in a core component. Addressing these will ensure the refactoring is robust and maintainable.
Note: Security Review is unavailable for this PR.
89b587e to
0c3e73f
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request is a significant and well-executed refactoring to decouple the base single-controller interface from Ray-specific implementations, making it backend-agnostic. The introduction of a mixin-based extension model, abstract base classes for core components, and the migration of Ray-specific logic are excellent design choices. The changes are consistent and the new tests for the mixin injection are thorough. I've identified two main issues: a critical bug in the dynamic class creation that could break serialization, and a leaky abstraction in the base decorator that undermines the goal of being fully backend-agnostic. Addressing these will make the new abstraction more robust and maintainable.
…single controller interface
### What does this PR do?
Removes Ray-specific concepts from `verl/single_controller/base/` so the base
interface is backend-agnostic. Previously `Worker` called
`ray.util.get_node_ip_address()`, `ResourcePoolManager` was a Ray-coupled
dataclass, and `decorator.py` did inline `import ray` throughout. This PR
establishes a clean abstraction boundary and introduces an explicit backend
extension model.
### Test
GPU tests (single node, ran on dev server):
```
uv run pytest -s -x \
tests/single_controller/test_colocated_workers.py \
tests/single_controller/test_colocated_workers_fused.py \
tests/single_controller/test_data_transfer.py \
tests/single_controller/test_device_mesh_register.py \
tests/single_controller/test_driverfunc_to_worker.py \
tests/single_controller/test_high_level_scheduling_api.py \
tests/single_controller/test_nested_worker.py \
tests/single_controller/test_ray_collectives.py \
tests/single_controller/test_split_resource_pool.py \
tests/single_controller/test_worker_group_basics.py \
tests/single_controller/test_worker_group_torch.py
```
CPU tests:
```
uv run pytest -s -x tests/single_controller/test_ray_class_with_init_args_on_cpu.py
```
All passed.
### API and Usage Example
**Before** — user classes had to be wrapped at the call site:
```python
role_worker_mapping = {
Role.ActorRollout: ray.remote(ActorRolloutRefWorker),
Role.Critic: ray.remote(CriticWorker),
}
ray_cls = RayClassWithInitArgs(cls=ray.remote(MyWorker), ...)
```
**After** — pass plain classes; the backend handles wrapping:
```python
role_worker_mapping = {
Role.ActorRollout: ActorRolloutRefWorker,
Role.Critic: CriticWorker,
}
ray_cls = RayClassWithInitArgs(cls=MyWorker, ...)
```
**BREAKING**: `RayClassWithInitArgs` now raises `ValueError` if passed an
already-`@ray.remote`-decorated class. Silently stripping the decoration
would discard any options (e.g. `num_gpus`, `max_restarts`) set on it.
Deprecation shims preserved:
- `PPOTrainer` is the new name; `RayPPOTrainer = PPOTrainer` kept as alias
- `RayWorkerGroup.__init__` accepts deprecated `ray_cls_with_init` kwarg
- `ResourcePoolManager` alias in `ray/__init__.py` -> `RayResourcePoolManager`
### Design & Code Changes
**Pattern introduced**
User-defined worker classes now only inherit from the base `Worker`:
```python
class MyActorWorker(Worker):
...
```
Backend-specific logic is handled via a Mixin injected automatically by the
backend's worker group. When a caller uses `RayWorkerGroup`,
`RayClassWithInitArgs.__init__` dynamically injects `RayWorkerMixin` into any
plain `Worker` subclass before wrapping it with `ray.remote`. `RayWorkerMixin`
provides the Ray-specific implementations of `_get_node_ip()` and the NOSET
device setup logic that previously lived in the base `Worker` class.
**`base/worker.py`**: `WorkerHelper._get_node_ip` made abstract; Ray-specific
NOSET device logic removed from `_setup_env_cuda_visible_devices`.
**`base/worker_group.py`**: `WorkerGroup` gains explicit extension points
(`cls_with_init_args`, `create_colocated_worker_cls`, `get`,
`worker_group_kwargs`, `backend`). `ResourcePoolManager` becomes a pure ABC
with three abstract methods — no fields, no Ray imports.
**`base/decorator.py`**: `parallel_put` gated behind
`worker_group.backend == "ray"`.
**`ray/base.py`**: `RayWorkerMixin` introduced and auto-injected by
`RayClassWithInitArgs`. `RayResourcePoolManager` implements the ABC.
`RayWorkerGroup` implements all `WorkerGroup` extension points.
**`ray_trainer.py`**: `RayPPOTrainer` renamed to `PPOTrainer`;
`RayPPOTrainer = PPOTrainer` kept as a backwards-compatibility alias.
38f0100 to
f1a54c4
Compare
What does this PR do?
Removes Ray-specific concepts from
verl/single_controller/base/so the base interface is backend-agnostic. PreviouslyWorkercalledray.util.get_node_ip_address(),ResourcePoolManagerwas a Ray-coupled dataclass, anddecorator.pydid inlineimport raythroughout. This PR establishes a clean abstraction boundary and introduces an explicit backend extension model. Additionally, theray_trainer.pyis now no longer tied to Ray -- it is backend injectable.Test
CPU tests:
Main GRPO training can run e2e (
verl.trainer.main_ppo)Design & Code Changes
Backend Extension Pattern
User-defined worker classes only inherit from the base
Workerclass which is backend agnostic.Backend-specific logic is handled via a Mixin injected automatically by the backend's worker group. When a caller uses
RayWorkerGroup,RayClassWithInitArgs.__init__dynamically injectsRayWorkerMixininto any plainWorkersubclass before wrapping it withray.remote.RayWorkerMixinprovides the Ray-specific implementations of_get_node_ip()and the NOSET device setup logic that previously lived in the baseWorkerclass.base/worker.py:WorkerHelper._get_node_ipmade abstract; Ray-specific NOSET device logic removed from_setup_env_cuda_visible_devices.base/worker_group.py:WorkerGroupgains explicit extension points (cls_with_init_args,create_colocated_worker_cls,get,worker_group_kwargs,backend).ResourcePoolManagerbecomes a pure ABC with three abstract methods — no fields, no Ray imports.base/decorator.py:parallel_putgated behindworker_group.backend == "ray".ray/base.py:RayWorkerMixinintroduced and auto-injected byRayClassWithInitArgs.RayResourcePoolManagerimplements the ABC.RayWorkerGroupimplements allWorkerGroupextension points.ray_trainer.py:RayPPOTrainerrenamed toPPOTrainer;RayPPOTrainer = PPOTrainerkept as a backwards-compatibility alias. The trainer is effectively backend injectable.API Changes
Before — user classes had to be wrapped at the call site:
After — pass plain classes; the backend handles wrapping:
BREAKING:
RayClassWithInitArgsnow raisesValueErrorif passed an already-@ray.remote-decorated class. This is to simplify the codebase and there's no safe way to silently stripping the decoration and preserve the options (e.g.num_gpus,max_restarts) set on it.Deprecation shims preserved:
PPOTraineris the new name;RayPPOTrainer = PPOTrainerkept as aliasRayWorkerGroup.__init__accepts deprecatedray_cls_with_initkwargResourcePoolManageralias inray/__init__.py->RayResourcePoolManagerChecklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.