Skip to content

Builtin mrope#4393

Open
grimoire wants to merge 12 commits intoInternLM:mainfrom
grimoire:builtin-mrope
Open

Builtin mrope#4393
grimoire wants to merge 12 commits intoInternLM:mainfrom
grimoire:builtin-mrope

Conversation

@grimoire
Copy link
Collaborator

@grimoire grimoire commented Mar 4, 2026

  • Mrope does not require update_model_metas.
  • remove make_buffer and fill_buffer callback for mrope and ssm.
  • fix qwen2/2.5 failed on transformers>=5 (which is broken by lm_head update)

Copilot AI review requested due to automatic review settings March 4, 2026 02:43
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR centralizes MROPE (multi-dimensional rotary position ids) handling into the scheduler/engine pipeline, adds a meta mechanism for generating correct dummy inputs (SSM/MROPE aware), and refactors cudagraph buffer handling to be model-agnostic.

Changes:

  • Add ModelConfig.use_mrope and propagate it through SequenceMeta/scheduler history to generate and carry MROPE position ids end-to-end.
  • Extend make_dummy_inputs / ModelInputsStrategy.make_dummy with MakeDummyMeta so warmup/cudagraph capture includes optional SSM + MROPE inputs.
  • Remove per-model cudagraph/mrope meta update overrides in several Qwen/GLM model implementations in favor of shared infrastructure.

Reviewed changes

Copilot reviewed 30 out of 30 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
lmdeploy/pytorch/strategies/dllm/model_inputs.py Thread MakeDummyMeta through DLLM dummy input creation.
lmdeploy/pytorch/strategies/base/model_inputs.py Introduce MakeDummyMeta, add optional dummy fields for SSM/MROPE, and a factory method from ModelConfig.
lmdeploy/pytorch/strategies/ar_spec/sequence.py Update sequence token update flow to also update MROPE history.
lmdeploy/pytorch/strategies/ar_spec/model_inputs.py Thread MakeDummyMeta through AR-spec dummy input creation.
lmdeploy/pytorch/strategies/ar/sequence.py Update sequence token update flow to also update MROPE history.
lmdeploy/pytorch/strategies/ar/model_inputs.py Add MROPE propagation in merge/index_select paths and thread MakeDummyMeta into dummy creation.
lmdeploy/pytorch/strategies/ar/model_agent.py Add MROPE propagation when building next-step decoding ModelInputs.
lmdeploy/pytorch/spec_decode/spec_agent.py Cache dummy-meta and pass it into warmup dummy inputs.
lmdeploy/pytorch/paging/scheduler.py Modernize typing for optional seq_meta.
lmdeploy/pytorch/multimodal/image_type.py Remove unused ImageData type.
lmdeploy/pytorch/multimodal/data_type.py Add mrope_pos_ids field to multimodal tensors and modernize typing.
lmdeploy/pytorch/multimodal/init.py Update exports after multimodal type cleanup.
lmdeploy/pytorch/models/utils/cudagraph.py Add generic cudagraph buffers/handling for MROPE + SSM and plumb through context updates.
lmdeploy/pytorch/models/qwen3_vl.py Remove per-model cudagraph/mrope meta overrides (rely on shared pipeline).
lmdeploy/pytorch/models/qwen3_next.py Remove per-model cudagraph SSM buffer overrides (rely on shared pipeline).
lmdeploy/pytorch/models/qwen3_5.py Remove per-model cudagraph/mrope+SSM overrides (rely on shared pipeline).
lmdeploy/pytorch/models/qwen2_vl.py Remove per-model cudagraph/mrope meta update logic and add MROPE pos-id generation in input processor.
lmdeploy/pytorch/models/qwen2_5_vl.py Reuse Qwen2-VL input processor and remove duplicated processor / per-model cudagraph logic.
lmdeploy/pytorch/models/glm4_1v.py Reuse Qwen2-VL input processor and remove duplicated processor / per-model cudagraph logic.
lmdeploy/pytorch/model_inputs.py Add MROPE tensors to ModelInputs/StepContext and plumb into context creation.
lmdeploy/pytorch/messages.py Add per-sequence MROPE history storage and automatic updates on token append.
lmdeploy/pytorch/engine/model_agent/inputs_maker.py Pass dummy-meta into dummy forward inputs used by the model agent input maker.
lmdeploy/pytorch/engine/model_agent/agent.py Cache dummy-meta at agent construction and pass it into warmup dummy inputs.
lmdeploy/pytorch/engine/inputs_maker.py Add use_mrope to engine inputs config and attach MROPE ids to ModelInputs.
lmdeploy/pytorch/engine/engine.py Propagate ModelConfig.use_mrope into SequenceMeta construction.
lmdeploy/pytorch/configurations/qwen3_vl.py Enable use_mrope for VL model configs.
lmdeploy/pytorch/configurations/qwen3_5.py Enable use_mrope for Qwen3.5 configs.
lmdeploy/pytorch/configurations/glm4.py Enable use_mrope for GLM4 moe-lite configs.
lmdeploy/pytorch/config.py Add ModelConfig.use_mrope flag.
lmdeploy/pytorch/backends/cuda/graph_runner.py Plumb use_mrope/is_ssm into cudagraph meta for buffer creation/filling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +154 to +156
if mrope_pos_ids is not None:
mrope_pos_ids = mrope_pos_ids[:, indices]

Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ARModelInputsStrategy.index_select() applies indices (which are batch indices from ModelInputsDelta.indice_cpu) directly to mrope_pos_ids via mrope_pos_ids[:, indices]. But mrope_pos_ids is stored in packed-token layout (shape (3, sum_q_seqlens); see InputsMaker.create_model_inputs() concatenation + transpose), so selecting by batch indices will pick the wrong token columns whenever max_q_seqlen > 1 (e.g., speculative decoding / block decoding). This needs to select the token segments for the kept sequences (or reshape/unpack before indexing) so that mrope positions stay aligned with input_ids/seq_length.

Suggested change
if mrope_pos_ids is not None:
mrope_pos_ids = mrope_pos_ids[:, indices]
if mrope_pos_ids is not None and indices is not None:
# mrope_pos_ids is stored in packed-token layout of shape
# (3, sum_q_seqlens), where tokens of all sequences are
# concatenated along dim=1. The `indices` tensor contains
# sequence (batch) indices, so we need to select the
# corresponding token segments and re-pack them, instead of
# indexing dim=1 directly by batch indices.
orig_seq_length = inputs.seq_length
if orig_seq_length is not None:
# Compute start offsets for each sequence in the packed layout.
# For lengths L[0..B-1], start_offsets[i] = sum_{j < i} L[j].
seq_lens = orig_seq_length.to(dtype=torch.long)
start_offsets = torch.zeros_like(seq_lens)
if seq_lens.numel() > 1:
start_offsets[1:] = torch.cumsum(seq_lens[:-1], dim=0)
# Gather segments for the kept sequences and concatenate.
selected_starts = start_offsets[indices].tolist()
selected_lens = seq_lens[indices].tolist()
segments = []
for start, length in zip(selected_starts, selected_lens):
end = start + length
segments.append(mrope_pos_ids[:, start:end])
if segments:
mrope_pos_ids = torch.cat(segments, dim=1)
else:
# No sequences kept; preserve correct shape on dim=0.
mrope_pos_ids = mrope_pos_ids[:, :0]

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants