[Fix][Feat] Fix worker sorting with external pg bundles & Support persistent buffer for update_params by CyCle1024 · Pull Request #4397 · InternLM/lmdeploy

CyCle1024 · 2026-03-06T04:48:33Z

Motivation

This PR addresses two issues related to distributed training parameter update in RLHF scenarios:

Worker Sorting Issue: When using external placement group bundle indices (e.g., in specific Ray cluster configurations), the workers might be incorrectly sorted by IP address, which broke the intended worker-to-bundle mapping.
Memory Copy Overhead: The update_params interface used for RL training required cloning tensors for IPC (Inter-Process Communication) on every parameter update, causing unnecessary memory overhead and synchronization costs.

Modification

ray_executor.py:
- Modified _sort_workers to skip IP-based sorting when external bundle indices are specified via LMDEPLOY_RAY_EXTERNAL_PG_BUNDLES env var
- Refactored bundle index handling to preserve user-specified order when using external placement groups
- Removed unused _valid_bundle_id method
agent.py:
- Added persistent IPC buffer (_update_params_ipc_tensor) and CUDA event (_update_params_ipc_event) for efficient parameter updates
- Support receiving pre-allocated tensor and IPC event handle from upstream to avoid unnecessary tensor cloning
- Proper cleanup of IPC resources after parameter updates and during sleep mode
utils.py:
- Enhanced FlattenedTensorBucket to support optional pre-allocated flattened tensor buffer
- Made flattened_tensor field optional in serialization to support zero-copy scenarios

BC-breaking (Optional)

No BC-breaking changes. The modifications are backward compatible:

External bundle index handling is opt-in via environment variable
The new persistent buffer mechanism gracefully falls back to clone behavior when not provided

Use cases (Optional)

This PR optimizes RL training workflows where:

Ray placement groups are pre-configured with specific bundle indices
Frequent parameter updates occur between training steps
Minimizing memory copy overhead is critical for performance

Checklist

Pre-commit or other linting tools are used to fix the potential lint issues.
The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
The documentation has been modified accordingly, like docstring or example tutorials.

Copilot

Pull request overview

This PR improves RLHF/distributed weight-update workflows by (1) preserving Ray worker/bundle ordering when users provide external placement group bundle indices and (2) reducing IPC cloning overhead for frequent update_params calls via a persistent IPC tensor/event, supported by a preallocated flattened-tensor bucket path.

Changes:

Ray: skip IP-based worker sorting and preserve user-provided bundle index order when LMDEPLOY_RAY_EXTERNAL_PG_BUNDLES is set.
Pytorch agent: add persistent IPC tensor/event handling to avoid per-update cloning when possible.
Utils: extend FlattenedTensorBucket and serialization to support optional/preallocated flattened buffers (zero/low-copy paths).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.

File	Description
`lmdeploy/utils.py`	Makes flattened bucket serialization more flexible and adds a preallocated buffer path for concatenation.
`lmdeploy/pytorch/engine/model_agent/agent.py`	Adds persistent IPC tensor/event support in `update_params` and cleanup during sleep/finalization.
`lmdeploy/pytorch/engine/executor/ray_executor.py`	Adjusts worker sorting and bundle index selection for externally-provided Ray PG bundle indices.

Comments suppressed due to low confidence (1)

lmdeploy/pytorch/engine/model_agent/agent.py:1183

The comment says request.serialized_named_tensors is now a dict ..., but the code still treats request.serialized_named_tensors as base64-encoded bytes and the deserialized object (weights) is the dict. Please reword this to avoid misleading API users (especially since UpdateParamsRequest.serialized_named_tensors is typed as Union[str, List[str], Dict]).

            # request.serialized_named_tensors is now a dict with following keys:
            # - metadata: List[FlattenedTensorMetadata]
            # - flattened_tensor: the flattened tensor for weights, optional
            # - event_ipc_handle: the ipc handle of the event
            # that used to sync stream across processes, optional
            serialized_data = request.serialized_named_tensors
            if isinstance(serialized_data, list):
                serialized_data = serialized_data[self.dist_ctx.tp_group.rank]
            model = self.patched_model.get_model()
            weights = ForkingPickler.loads(pybase64.b64decode(serialized_data))

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-06T04:53:38Z

lmdeploy/utils.py

+                    assert flattened_tensor.numel() >= current_idx, \
+                        'Provided flattened tensor numel is smaller than ' + \
+                        f'required numel: {flattened_tensor.numel()} < {current_idx}'
+                    assert sum([t.numel() for t in flattened_tensor_list]) == current_idx


These new branches add non-trivial behavior (optional flattened_tensor in serialization and the preallocated-buffer out= cat path), but there are no unit tests covering them. Since this module already has tests (tests/test_lmdeploy/test_utils.py), adding focused tests around this initialization path would help prevent regressions.

Suggested change

assert sum([t.numel() for t in flattened_tensor_list]) == current_idx

assert sum([t.numel() for t in flattened_tensor_list]) == current_idx

# Validate that the provided preallocated buffer is compatible with inputs.

# All named_tensors are already verified to share the same dtype.

first_tensor = named_tensors[0][1] if named_tensors else None

if first_tensor is not None:

if flattened_tensor.dtype != first_tensor.dtype:

raise ValueError(

f'flattened_tensor dtype {flattened_tensor.dtype} does not match '

f'input tensors dtype {first_tensor.dtype}'

)

if flattened_tensor.device != first_tensor.device:

raise ValueError(

f'flattened_tensor device {flattened_tensor.device} does not match '

f'input tensors device {first_tensor.device}'

)

if not flattened_tensor.is_contiguous():

raise ValueError('flattened_tensor must be contiguous when used as an output buffer')

lmdeploy/pytorch/engine/executor/ray_executor.py

Copilot · 2026-03-06T04:53:38Z

lmdeploy/pytorch/engine/executor/ray_executor.py

+        if not _envs.ray_external_pg_bundles:
+            for bundle_id, bundle in enumerate(placement_group.bundle_specs):
+                if bundle.get(device_str, 0):
+                    bundle_indices.append(bundle_id)
+        else:
+            # use external specified bundle indices，keep the order as well
+            bundle_indices = _envs.ray_external_pg_bundles.copy()
        attn_tp = self.dist_config.attn_tp
        bundle_indices = bundle_indices[:attn_tp]



When _envs.ray_external_pg_bundles is set, bundle_indices is taken verbatim with no validation. It would be safer to validate indices are in-range for placement_group.bundle_specs, that each selected bundle actually has the required device_str resource, and that len(bundle_indices) >= attn_tp (or raise a clear error). Otherwise Ray may fail later with a less actionable scheduling error or create fewer workers than required.

Copilot · 2026-03-06T04:53:39Z

lmdeploy/pytorch/engine/executor/ray_executor.py

+            # use external specified bundle indices，keep the order as well
+            bundle_indices = _envs.ray_external_pg_bundles.copy()


PR description mentions an env var RAY_EXTERNAL_PG_BUNDLES, but the code reads LMDEPLOY_RAY_EXTERNAL_PG_BUNDLES via _envs.ray_external_pg_bundles. Please align the PR description / documentation (or add an alias) to avoid users setting the wrong variable and not getting the intended behavior.

PR description mentions an env var RAY_EXTERNAL_PG_BUNDLES, but the code reads LMDEPLOY_RAY_EXTERNAL_PG_BUNDLES via _envs.ray_external_pg_bundles. Please align the PR description / documentation (or add an alias) to avoid users setting the wrong variable and not getting the intended behavior.

@copilot I update PR description, please view this again.

Copilot · 2026-03-06T04:53:39Z

lmdeploy/pytorch/engine/model_agent/agent.py

+                    if 'flattened_tensor' in weights:
+                        self._update_params_ipc_tensor = _construct(weights['flattened_tensor'],
+                                                                    require_clone=weights.get('require_clone', True))
+                    if 'event_ipc_handle' in weights and hasattr(torch.cuda.Event, 'from_ipc_handle'):
+                        self._update_params_ipc_event = torch.cuda.Event.from_ipc_handle(
+                            device=torch.cuda.current_device(),
+                            handle=weights['event_ipc_handle'],
+                        )
+                    flattened_tensor: torch.Tensor = self._update_params_ipc_tensor
+                    if self._update_params_ipc_event is not None:
+                        self._update_params_ipc_event.wait()
                    bucket = FlattenedTensorBucket(flattened_tensor=flattened_tensor, metadata=metadata)
                    weights = bucket.reconstruct_tensors()


When request.load_format == 'flattened_bucket' and metadata is non-empty, this code will pass flattened_tensor=None into FlattenedTensorBucket if (a) the request omits flattened_tensor and (b) _update_params_ipc_tensor has not been initialized from a prior call. That will raise in FlattenedTensorBucket.__init__ and makes the “flattened_tensor optional” behavior fragile; consider explicitly erroring with a clear message unless a cached _update_params_ipc_tensor already exists (or require flattened_tensor on the first call).

Copilot · 2026-03-06T04:53:39Z

lmdeploy/pytorch/engine/model_agent/agent.py

+                    if 'event_ipc_handle' in weights and hasattr(torch.cuda.Event, 'from_ipc_handle'):
+                        self._update_params_ipc_event = torch.cuda.Event.from_ipc_handle(
+                            device=torch.cuda.current_device(),
+                            handle=weights['event_ipc_handle'],
+                        )
+                    flattened_tensor: torch.Tensor = self._update_params_ipc_tensor
+                    if self._update_params_ipc_event is not None:
+                        self._update_params_ipc_event.wait()


If the producer supplies event_ipc_handle but the local PyTorch build lacks torch.cuda.Event.from_ipc_handle, this silently skips cross-process stream synchronization and proceeds to read from the IPC tensor. That can lead to stale/partially-written weights; consider failing fast (or forcing require_clone=True / torch.cuda.synchronize() fallback) when an event handle is provided but cannot be imported.

Copilot · 2026-03-06T04:53:39Z

lmdeploy/utils.py

+                    assert len(flattened_tensor.shape) == 1, 'flattened_tensor must be 1-D tensor'
+                    assert flattened_tensor.numel() >= current_idx, \
+                        'Provided flattened tensor numel is smaller than ' + \
+                        f'required numel: {flattened_tensor.numel()} < {current_idx}'
+                    assert sum([t.numel() for t in flattened_tensor_list]) == current_idx


In the preallocated-buffer path, input validation uses assert and only checks 1-D and numel(). Since this is user-provided input (and asserts can be stripped with -O), it would be safer to raise ValueError with a clear message and also validate dtype/device (and ideally contiguity) match the source tensors to avoid cryptic torch.cat(..., out=...) runtime errors.

Suggested change

assert len(flattened_tensor.shape) == 1, 'flattened_tensor must be 1-D tensor'

assert flattened_tensor.numel() >= current_idx, \

'Provided flattened tensor numel is smaller than ' + \

f'required numel: {flattened_tensor.numel()} < {current_idx}'

assert sum([t.numel() for t in flattened_tensor_list]) == current_idx

# Validate user-provided preallocated buffer.

if flattened_tensor.dim() != 1:

raise ValueError(

f'flattened_tensor must be a 1-D tensor, but got shape {tuple(flattened_tensor.shape)}'

)

if flattened_tensor.numel() < current_idx:

raise ValueError(

'Provided flattened tensor numel is smaller than required numel: '

f'{flattened_tensor.numel()} < {current_idx}'

)

# Ensure dtype and device are compatible with source tensors.

reference_tensor = named_tensors[0][1]

if flattened_tensor.dtype != reference_tensor.dtype:

raise ValueError(

f'flattened_tensor dtype {flattened_tensor.dtype} does not match source '

f'tensors dtype {reference_tensor.dtype}'

)

if flattened_tensor.device != reference_tensor.device:

raise ValueError(

f'flattened_tensor device {flattened_tensor.device} does not match source '

f'tensors device {reference_tensor.device}'

)

if not flattened_tensor.is_contiguous():

raise ValueError('flattened_tensor must be contiguous')

total_numel = sum(t.numel() for t in flattened_tensor_list)

if total_numel != current_idx:

raise ValueError(

'Mismatch between computed and expected flattened size: '

f'{total_numel} != {current_idx}'

)

… ipc tensor

…ucket validation

…e require_clone when IPC event unavailable

CyCle1024 added 2 commits March 5, 2026 21:29

[Feat] support persistent buffer for update_params

25793f2

[Fix] remain workers unsorted when using external pg bundle indice

f8c505a

Copilot AI review requested due to automatic review settings March 6, 2026 04:48

Copilot started reviewing on behalf of CyCle1024 March 6, 2026 04:49 View session

Copilot AI reviewed Mar 6, 2026

View reviewed changes

lvhan028 requested a review from RunningLeon March 7, 2026 13:58

CyCle1024 added 4 commits March 9, 2026 18:03

Fix ray worker sorting condition and add validation for update_params…

6bfd136

… ipc tensor

Add validation for external bundle indices in ray executor

d53dcce

Fix comment for serialized_named_tensors and improve FlattenedTensorB…

af3c2e1

…ucket validation

Add unit tests for FlattenedTensorBucket preallocated buffer and forc…

2cadf8d

…e require_clone when IPC event unavailable

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix][Feat] Fix worker sorting with external pg bundles & Support persistent buffer for update_params#4397

[Fix][Feat] Fix worker sorting with external pg bundles & Support persistent buffer for update_params#4397
CyCle1024 wants to merge 6 commits intoInternLM:mainfrom
CyCle1024:update_params_persist_buffer

CyCle1024 commented Mar 6, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

CyCle1024 Mar 9, 2026 •

edited

Loading

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-                    assert sum([t.numel() for t in flattened_tensor_list]) == current_idx
+                    assert sum([t.numel() for t in flattened_tensor_list]) == current_idx
+                    # Validate that the provided preallocated buffer is compatible with inputs.
+                    # All named_tensors are already verified to share the same dtype.
+                    first_tensor = named_tensors[0][1] if named_tensors else None
+                    if first_tensor is not None:
+                        if flattened_tensor.dtype != first_tensor.dtype:
+                            raise ValueError(
+                                f'flattened_tensor dtype {flattened_tensor.dtype} does not match '
+                                f'input tensors dtype {first_tensor.dtype}'
+                            )
+                        if flattened_tensor.device != first_tensor.device:
+                            raise ValueError(
+                                f'flattened_tensor device {flattened_tensor.device} does not match '
+                                f'input tensors device {first_tensor.device}'
+                            )
+                    if not flattened_tensor.is_contiguous():
+                        raise ValueError('flattened_tensor must be contiguous when used as an output buffer')

		# use external specified bundle indices，keep the order as well
		bundle_indices = _envs.ray_external_pg_bundles.copy()

-                    assert len(flattened_tensor.shape) == 1, 'flattened_tensor must be 1-D tensor'
-                    assert flattened_tensor.numel() >= current_idx, \
-                        'Provided flattened tensor numel is smaller than ' + \
-                        f'required numel: {flattened_tensor.numel()} < {current_idx}'
-                    assert sum([t.numel() for t in flattened_tensor_list]) == current_idx
+                    # Validate user-provided preallocated buffer.
+                    if flattened_tensor.dim() != 1:
+                        raise ValueError(
+                            f'flattened_tensor must be a 1-D tensor, but got shape {tuple(flattened_tensor.shape)}'
+                        )
+                    if flattened_tensor.numel() < current_idx:
+                        raise ValueError(
+                            'Provided flattened tensor numel is smaller than required numel: '
+                            f'{flattened_tensor.numel()} < {current_idx}'
+                        )
+                    # Ensure dtype and device are compatible with source tensors.
+                    reference_tensor = named_tensors[0][1]
+                    if flattened_tensor.dtype != reference_tensor.dtype:
+                        raise ValueError(
+                            f'flattened_tensor dtype {flattened_tensor.dtype} does not match source '
+                            f'tensors dtype {reference_tensor.dtype}'
+                        )
+                    if flattened_tensor.device != reference_tensor.device:
+                        raise ValueError(
+                            f'flattened_tensor device {flattened_tensor.device} does not match source '
+                            f'tensors device {reference_tensor.device}'
+                        )
+                    if not flattened_tensor.is_contiguous():
+                        raise ValueError('flattened_tensor must be contiguous')
+                    total_numel = sum(t.numel() for t in flattened_tensor_list)
+                    if total_numel != current_idx:
+                        raise ValueError(
+                            'Mismatch between computed and expected flattened size: '
+                            f'{total_numel} != {current_idx}'
+                        )

Conversation

CyCle1024 commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modification

BC-breaking (Optional)

Use cases (Optional)

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

CyCle1024 Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CyCle1024 commented Mar 6, 2026 •

edited

Loading

CyCle1024 Mar 9, 2026 •

edited

Loading