Conversation
There was a problem hiding this comment.
Pull request overview
This PR centralizes MROPE (multi-dimensional rotary position ids) handling into the scheduler/engine pipeline, adds a meta mechanism for generating correct dummy inputs (SSM/MROPE aware), and refactors cudagraph buffer handling to be model-agnostic.
Changes:
- Add
ModelConfig.use_mropeand propagate it throughSequenceMeta/scheduler history to generate and carry MROPE position ids end-to-end. - Extend
make_dummy_inputs/ModelInputsStrategy.make_dummywithMakeDummyMetaso warmup/cudagraph capture includes optional SSM + MROPE inputs. - Remove per-model cudagraph/mrope meta update overrides in several Qwen/GLM model implementations in favor of shared infrastructure.
Reviewed changes
Copilot reviewed 30 out of 30 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| lmdeploy/pytorch/strategies/dllm/model_inputs.py | Thread MakeDummyMeta through DLLM dummy input creation. |
| lmdeploy/pytorch/strategies/base/model_inputs.py | Introduce MakeDummyMeta, add optional dummy fields for SSM/MROPE, and a factory method from ModelConfig. |
| lmdeploy/pytorch/strategies/ar_spec/sequence.py | Update sequence token update flow to also update MROPE history. |
| lmdeploy/pytorch/strategies/ar_spec/model_inputs.py | Thread MakeDummyMeta through AR-spec dummy input creation. |
| lmdeploy/pytorch/strategies/ar/sequence.py | Update sequence token update flow to also update MROPE history. |
| lmdeploy/pytorch/strategies/ar/model_inputs.py | Add MROPE propagation in merge/index_select paths and thread MakeDummyMeta into dummy creation. |
| lmdeploy/pytorch/strategies/ar/model_agent.py | Add MROPE propagation when building next-step decoding ModelInputs. |
| lmdeploy/pytorch/spec_decode/spec_agent.py | Cache dummy-meta and pass it into warmup dummy inputs. |
| lmdeploy/pytorch/paging/scheduler.py | Modernize typing for optional seq_meta. |
| lmdeploy/pytorch/multimodal/image_type.py | Remove unused ImageData type. |
| lmdeploy/pytorch/multimodal/data_type.py | Add mrope_pos_ids field to multimodal tensors and modernize typing. |
| lmdeploy/pytorch/multimodal/init.py | Update exports after multimodal type cleanup. |
| lmdeploy/pytorch/models/utils/cudagraph.py | Add generic cudagraph buffers/handling for MROPE + SSM and plumb through context updates. |
| lmdeploy/pytorch/models/qwen3_vl.py | Remove per-model cudagraph/mrope meta overrides (rely on shared pipeline). |
| lmdeploy/pytorch/models/qwen3_next.py | Remove per-model cudagraph SSM buffer overrides (rely on shared pipeline). |
| lmdeploy/pytorch/models/qwen3_5.py | Remove per-model cudagraph/mrope+SSM overrides (rely on shared pipeline). |
| lmdeploy/pytorch/models/qwen2_vl.py | Remove per-model cudagraph/mrope meta update logic and add MROPE pos-id generation in input processor. |
| lmdeploy/pytorch/models/qwen2_5_vl.py | Reuse Qwen2-VL input processor and remove duplicated processor / per-model cudagraph logic. |
| lmdeploy/pytorch/models/glm4_1v.py | Reuse Qwen2-VL input processor and remove duplicated processor / per-model cudagraph logic. |
| lmdeploy/pytorch/model_inputs.py | Add MROPE tensors to ModelInputs/StepContext and plumb into context creation. |
| lmdeploy/pytorch/messages.py | Add per-sequence MROPE history storage and automatic updates on token append. |
| lmdeploy/pytorch/engine/model_agent/inputs_maker.py | Pass dummy-meta into dummy forward inputs used by the model agent input maker. |
| lmdeploy/pytorch/engine/model_agent/agent.py | Cache dummy-meta at agent construction and pass it into warmup dummy inputs. |
| lmdeploy/pytorch/engine/inputs_maker.py | Add use_mrope to engine inputs config and attach MROPE ids to ModelInputs. |
| lmdeploy/pytorch/engine/engine.py | Propagate ModelConfig.use_mrope into SequenceMeta construction. |
| lmdeploy/pytorch/configurations/qwen3_vl.py | Enable use_mrope for VL model configs. |
| lmdeploy/pytorch/configurations/qwen3_5.py | Enable use_mrope for Qwen3.5 configs. |
| lmdeploy/pytorch/configurations/glm4.py | Enable use_mrope for GLM4 moe-lite configs. |
| lmdeploy/pytorch/config.py | Add ModelConfig.use_mrope flag. |
| lmdeploy/pytorch/backends/cuda/graph_runner.py | Plumb use_mrope/is_ssm into cudagraph meta for buffer creation/filling. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if mrope_pos_ids is not None: | ||
| mrope_pos_ids = mrope_pos_ids[:, indices] | ||
|
|
There was a problem hiding this comment.
ARModelInputsStrategy.index_select() applies indices (which are batch indices from ModelInputsDelta.indice_cpu) directly to mrope_pos_ids via mrope_pos_ids[:, indices]. But mrope_pos_ids is stored in packed-token layout (shape (3, sum_q_seqlens); see InputsMaker.create_model_inputs() concatenation + transpose), so selecting by batch indices will pick the wrong token columns whenever max_q_seqlen > 1 (e.g., speculative decoding / block decoding). This needs to select the token segments for the kept sequences (or reshape/unpack before indexing) so that mrope positions stay aligned with input_ids/seq_length.
| if mrope_pos_ids is not None: | |
| mrope_pos_ids = mrope_pos_ids[:, indices] | |
| if mrope_pos_ids is not None and indices is not None: | |
| # mrope_pos_ids is stored in packed-token layout of shape | |
| # (3, sum_q_seqlens), where tokens of all sequences are | |
| # concatenated along dim=1. The `indices` tensor contains | |
| # sequence (batch) indices, so we need to select the | |
| # corresponding token segments and re-pack them, instead of | |
| # indexing dim=1 directly by batch indices. | |
| orig_seq_length = inputs.seq_length | |
| if orig_seq_length is not None: | |
| # Compute start offsets for each sequence in the packed layout. | |
| # For lengths L[0..B-1], start_offsets[i] = sum_{j < i} L[j]. | |
| seq_lens = orig_seq_length.to(dtype=torch.long) | |
| start_offsets = torch.zeros_like(seq_lens) | |
| if seq_lens.numel() > 1: | |
| start_offsets[1:] = torch.cumsum(seq_lens[:-1], dim=0) | |
| # Gather segments for the kept sequences and concatenate. | |
| selected_starts = start_offsets[indices].tolist() | |
| selected_lens = seq_lens[indices].tolist() | |
| segments = [] | |
| for start, length in zip(selected_starts, selected_lens): | |
| end = start + length | |
| segments.append(mrope_pos_ids[:, start:end]) | |
| if segments: | |
| mrope_pos_ids = torch.cat(segments, dim=1) | |
| else: | |
| # No sequences kept; preserve correct shape on dim=0. | |
| mrope_pos_ids = mrope_pos_ids[:, :0] |
update_model_metas.make_bufferandfill_buffercallback for mrope and ssm.