[BREAKING][single_controller] refactor: remove Ray leakage from base interface by JenniferWang · Pull Request #5463 · verl-project/verl

JenniferWang · 2026-03-02T21:17:27Z

What does this PR do?

Removes Ray-specific concepts from verl/single_controller/base/ so the base interface is backend-agnostic. Previously Worker called ray.util.get_node_ip_address(), ResourcePoolManager was a Ray-coupled dataclass, and decorator.py did inline import ray throughout. This PR establishes a clean abstraction boundary and introduces an explicit backend extension model. Additionally, the ray_trainer.py is now no longer tied to Ray -- it is backend injectable.

Test

CPU tests:

uv run pytest -s -x tests/single_controller/test_ray_class_with_init_args_on_cpu.py

Main GRPO training can run e2e (verl.trainer.main_ppo)

Design & Code Changes

Backend Extension Pattern

User-defined worker classes only inherit from the base Worker class which is backend agnostic.

class MyActorWorker(Worker):
    ...

Backend-specific logic is handled via a Mixin injected automatically by the backend's worker group. When a caller uses RayWorkerGroup, RayClassWithInitArgs.__init__ dynamically injects RayWorkerMixin into any plain Worker subclass before wrapping it with ray.remote. RayWorkerMixin provides the Ray-specific implementations of _get_node_ip() and the NOSET device setup logic that previously lived in the base Worker class.

base/worker.py: WorkerHelper._get_node_ip made abstract; Ray-specific NOSET device logic removed from _setup_env_cuda_visible_devices.

base/worker_group.py: WorkerGroup gains explicit extension points (cls_with_init_args, create_colocated_worker_cls, get, worker_group_kwargs, backend). ResourcePoolManager becomes a pure ABC with three abstract methods — no fields, no Ray imports.

base/decorator.py: parallel_put gated behind worker_group.backend == "ray".

ray/base.py: RayWorkerMixin introduced and auto-injected by RayClassWithInitArgs. RayResourcePoolManager implements the ABC. RayWorkerGroup implements all WorkerGroup extension points.

ray_trainer.py: RayPPOTrainer renamed to PPOTrainer; RayPPOTrainer = PPOTrainer kept as a backwards-compatibility alias. The trainer is effectively backend injectable.

API Changes

Before — user classes had to be wrapped at the call site:

role_worker_mapping = {
    Role.ActorRollout: ray.remote(ActorRolloutRefWorker),
    Role.Critic: ray.remote(CriticWorker),
}
ray_cls = RayClassWithInitArgs(cls=ray.remote(MyWorker), ...)

After — pass plain classes; the backend handles wrapping:

role_worker_mapping = {
    Role.ActorRollout: ActorRolloutRefWorker,
    Role.Critic: CriticWorker,
}
ray_cls = RayClassWithInitArgs(cls=MyWorker, ...)

BREAKING: RayClassWithInitArgs now raises ValueError if passed an already-@ray.remote-decorated class. This is to simplify the codebase and there's no safe way to silently stripping the decoration and preserve the options (e.g. num_gpus, max_restarts) set on it.

Deprecation shims preserved:

PPOTrainer is the new name; RayPPOTrainer = PPOTrainer kept as alias
RayWorkerGroup.__init__ accepts deprecated ray_cls_with_init kwarg
ResourcePoolManager alias in ray/__init__.py -> RayResourcePoolManager

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)
If your PR is related to the recipe submodule, please also update the reference to the submodule commit via git submodule update --remote or cd recipe && git pull origin main.

CLAassistant · 2026-03-02T21:17:38Z

All committers have signed the CLA.

gemini-code-assist

Code Review

This pull request is a significant and well-executed refactoring to remove Ray-specific logic from the base single-controller interfaces, making them backend-agnostic. The introduction of a RayWorkerMixin and an explicit backend extension model in WorkerGroup is a clean approach to separating concerns. The changes are extensive and consistently applied across the codebase.

My review has identified a couple of issues: one critical issue regarding a breaking API change that will cause a TypeError, and one high-severity issue related to confusing deprecation handling in a core component. Addressing these will ensure the refactoring is robust and maintainable.

_{Note: Security Review is unavailable for this PR.}

verl/trainer/ppo/ray_trainer.py

verl/single_controller/ray/base.py

JenniferWang · 2026-03-02T22:03:16Z

/gemini review

gemini-code-assist

Code Review

This pull request is a significant and well-executed refactoring to decouple the base single-controller interface from Ray-specific implementations, making it backend-agnostic. The introduction of a mixin-based extension model, abstract base classes for core components, and the migration of Ray-specific logic are excellent design choices. The changes are consistent and the new tests for the mixin injection are thorough. I've identified two main issues: a critical bug in the dynamic class creation that could break serialization, and a leaky abstraction in the base decorator that undermines the goal of being fully backend-agnostic. Addressing these will make the new abstraction more robust and maintainable.

verl/single_controller/ray/base.py

verl/single_controller/base/decorator.py

…single controller interface ### What does this PR do? Removes Ray-specific concepts from `verl/single_controller/base/` so the base interface is backend-agnostic. Previously `Worker` called `ray.util.get_node_ip_address()`, `ResourcePoolManager` was a Ray-coupled dataclass, and `decorator.py` did inline `import ray` throughout. This PR establishes a clean abstraction boundary and introduces an explicit backend extension model. ### Test GPU tests (single node, ran on dev server): ``` uv run pytest -s -x \ tests/single_controller/test_colocated_workers.py \ tests/single_controller/test_colocated_workers_fused.py \ tests/single_controller/test_data_transfer.py \ tests/single_controller/test_device_mesh_register.py \ tests/single_controller/test_driverfunc_to_worker.py \ tests/single_controller/test_high_level_scheduling_api.py \ tests/single_controller/test_nested_worker.py \ tests/single_controller/test_ray_collectives.py \ tests/single_controller/test_split_resource_pool.py \ tests/single_controller/test_worker_group_basics.py \ tests/single_controller/test_worker_group_torch.py ``` CPU tests: ``` uv run pytest -s -x tests/single_controller/test_ray_class_with_init_args_on_cpu.py ``` All passed. ### API and Usage Example **Before** — user classes had to be wrapped at the call site: ```python role_worker_mapping = { Role.ActorRollout: ray.remote(ActorRolloutRefWorker), Role.Critic: ray.remote(CriticWorker), } ray_cls = RayClassWithInitArgs(cls=ray.remote(MyWorker), ...) ``` **After** — pass plain classes; the backend handles wrapping: ```python role_worker_mapping = { Role.ActorRollout: ActorRolloutRefWorker, Role.Critic: CriticWorker, } ray_cls = RayClassWithInitArgs(cls=MyWorker, ...) ``` **BREAKING**: `RayClassWithInitArgs` now raises `ValueError` if passed an already-`@ray.remote`-decorated class. Silently stripping the decoration would discard any options (e.g. `num_gpus`, `max_restarts`) set on it. Deprecation shims preserved: - `PPOTrainer` is the new name; `RayPPOTrainer = PPOTrainer` kept as alias - `RayWorkerGroup.__init__` accepts deprecated `ray_cls_with_init` kwarg - `ResourcePoolManager` alias in `ray/__init__.py` -> `RayResourcePoolManager` ### Design & Code Changes **Pattern introduced** User-defined worker classes now only inherit from the base `Worker`: ```python class MyActorWorker(Worker): ... ``` Backend-specific logic is handled via a Mixin injected automatically by the backend's worker group. When a caller uses `RayWorkerGroup`, `RayClassWithInitArgs.__init__` dynamically injects `RayWorkerMixin` into any plain `Worker` subclass before wrapping it with `ray.remote`. `RayWorkerMixin` provides the Ray-specific implementations of `_get_node_ip()` and the NOSET device setup logic that previously lived in the base `Worker` class. **`base/worker.py`**: `WorkerHelper._get_node_ip` made abstract; Ray-specific NOSET device logic removed from `_setup_env_cuda_visible_devices`. **`base/worker_group.py`**: `WorkerGroup` gains explicit extension points (`cls_with_init_args`, `create_colocated_worker_cls`, `get`, `worker_group_kwargs`, `backend`). `ResourcePoolManager` becomes a pure ABC with three abstract methods — no fields, no Ray imports. **`base/decorator.py`**: `parallel_put` gated behind `worker_group.backend == "ray"`. **`ray/base.py`**: `RayWorkerMixin` introduced and auto-injected by `RayClassWithInitArgs`. `RayResourcePoolManager` implements the ABC. `RayWorkerGroup` implements all `WorkerGroup` extension points. **`ray_trainer.py`**: `RayPPOTrainer` renamed to `PPOTrainer`; `RayPPOTrainer = PPOTrainer` kept as a backwards-compatibility alias.

JenniferWang requested review from ArronHZG, PeterSH6, eric-haibin-lin, tongyx361, vermouth1992, wuxibin89 and zw0610 as code owners March 2, 2026 21:17

JenniferWang marked this pull request as draft March 2, 2026 21:17

gemini-code-assist bot reviewed Mar 2, 2026

View reviewed changes

verl/trainer/ppo/ray_trainer.py Show resolved Hide resolved

verl/single_controller/ray/base.py Outdated Show resolved Hide resolved

JenniferWang force-pushed the single-controller-no-rename branch from 89b587e to 0c3e73f Compare March 2, 2026 21:58

JenniferWang marked this pull request as ready for review March 2, 2026 22:05

gemini-code-assist bot reviewed Mar 2, 2026

View reviewed changes

verl/single_controller/ray/base.py Outdated Show resolved Hide resolved

verl/single_controller/base/decorator.py Show resolved Hide resolved

JenniferWang force-pushed the single-controller-no-rename branch from 38f0100 to f1a54c4 Compare March 2, 2026 22:12

trigger review

87b1806

JenniferWang marked this pull request as draft March 6, 2026 02:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BREAKING][single_controller] refactor: remove Ray leakage from base interface#5463

[BREAKING][single_controller] refactor: remove Ray leakage from base interface#5463
JenniferWang wants to merge 2 commits intoverl-project:mainfrom
JenniferWang:single-controller-no-rename

JenniferWang commented Mar 2, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Mar 2, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

JenniferWang commented Mar 2, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JenniferWang commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Test

Design & Code Changes

API Changes

Checklist Before Submitting

Uh oh!

CLAassistant commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

JenniferWang commented Mar 2, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JenniferWang commented Mar 2, 2026 •

edited

Loading

CLAassistant commented Mar 2, 2026 •

edited

Loading