[Feat]: Support qwen35 with mtp by RunningLeon · Pull Request #4437 · InternLM/lmdeploy

RunningLeon · 2026-03-20T09:13:49Z

Motivation

Suport Qwen3.5 mtp

api_server

lmdeploy serve api_server \
Qwen/Qwen3.5-35B-A3B \
--backend pytorch \
--tp 2 \
--speculative-algorithm 'qwen3_5_mtp' \
--speculative-num-draft-tokens 3 \
--max-batch-size 128 \
--session-len 65536

pipeline

from lmdeploy import pipeline, PytorchEngineConfig
from lmdeploy.messages import SpeculativeConfig

if __name__ == '__main__':
    prompts = ['Hi, pls intro yourself', 'Shanghai is']
    model_path = 'Qwen/Qwen3.5-35B-A3B'
    spec_cfg = SpeculativeConfig(method='qwen3_5_mtp', 
                                    num_speculative_tokens=3,
                                    model=model_path,
                                    )
    pipe = pipeline(model_path, 
                    backend_config=PytorchEngineConfig(max_batch_size=128),
                    speculative_config=spec_cfg)
    response = pipe(prompts)
    print(response)

Modification

BC-breaking (Optional)

Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.

Checklist

Pre-commit or other linting tools are used to fix the potential lint issues.
The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
The documentation has been modified accordingly, like docstring or example tutorials.

Copilot

Pull request overview

Adds PyTorch speculative decoding support for Qwen3.5 via an MTP-based draft model/proposer, including routing-expert recording and long-context chunk handling.

Changes:

Introduces qwen3_5_mtp proposer + Qwen3_5MTPModel and wires it through config/model maps/CLI & benchmarks.
Refactors speculative decoding flow to run sampling + rejection sampling inside SpecModelAgent, with expanded/sliced SamplingInputs and logprobs plumbing.
Extends long-context chunking + MROPE/state-cache handling to support spec decoding and multimodal chunk cases; adds new unit tests for spec agent + rejection sampler.

Reviewed changes

Copilot reviewed 45 out of 45 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
tests/pytorch/spec_decode/test_spec_agent.py	New tests for spec-agent sampling/logprobs + SamplingInputs expand/slice helpers
tests/pytorch/spec_decode/test_reject_sample.py	New tests for rejection sampling + Triton kernels
lmdeploy/utils.py	Allows `is_bf16_supported('auto')` to take CUDA path
lmdeploy/pytorch/strategies/dllm/model_agent.py	Updates to functional `ModelInputs.step()` return value
lmdeploy/pytorch/strategies/ar/model_agent.py	Updates to functional `ModelInputs.step()` return value
lmdeploy/pytorch/strategies/ar_spec/sequence.py	Records routed experts alongside token updates; handles per-token expert splits
lmdeploy/pytorch/strategies/ar_spec/sampling.py	Adds `num_spec_tokens` to ARSpec sampling strategy
lmdeploy/pytorch/strategies/ar_spec/model_inputs.py	Spec-decoding dummy inputs tweaks; MROPE pos-id reshaping during input updates
lmdeploy/pytorch/strategies/ar_spec/model_agent.py	Extends ARSpec extra inputs (embeds/logprobs), cloning/merge/update logic; prefill/decoding adjustments
lmdeploy/pytorch/strategies/ar_spec/engine.py	Adds `get_num_required_tokens()` for scheduling in spec decode
lmdeploy/pytorch/strategies/ar_spec/init.py	Passes `num_spec_tokens` into ARSpec sampling strategy
lmdeploy/pytorch/spec_decode/spec_agent.py	Major refactor: sampling + rejection sampling inside spec agent; chunk carry-over; input-embed support
lmdeploy/pytorch/spec_decode/reject_sampler.py	Adds Triton greedy/random rejection sampling kernels; supports mixed greedy/random batches
lmdeploy/pytorch/spec_decode/proposers/qwen3_5_mtp.py	New proposer registering `qwen3_5_mtp` (shares target embeddings)
lmdeploy/pytorch/spec_decode/proposers/base.py	Makes decoding input update functional; adds `embed_input_ids` helper
lmdeploy/pytorch/spec_decode/proposers/init.py	Exports Qwen3.5 MTP proposer
lmdeploy/pytorch/spec_decode/base.py	Base spec agent now stores `SpecDecodeConfig` + `num_spec_tokens`
lmdeploy/pytorch/spec_decode/init.py	Passes `misc_config` into spec-agent builder; initializes base agent with config
lmdeploy/pytorch/paging/scheduler.py	Renames scheduling arg to `num_required_tokens`
lmdeploy/pytorch/nn/gated_delta.py	Adds spec-decoding state/conv offset handling + cache seqlens plumbing
lmdeploy/pytorch/models/utils/cudagraph.py	Adds `block_size` to graph meta; updates FA3 metadata building and MROPE requirement
lmdeploy/pytorch/models/qwen3_5.py	Adds optional input-embed return for spec/multimodal chunking; attention head gating + TP toggles
lmdeploy/pytorch/models/qwen3_5_mtp.py	New Qwen3.5 MTP draft model implementation + weight loader
lmdeploy/pytorch/models/qwen3_5_moe.py	Adds `is_tp` parameter plumbing; tracks spec-decoding build context
lmdeploy/pytorch/models/module_map.py	Registers `Qwen3_5MTPModel` in module map
lmdeploy/pytorch/models/deepseek_mtp.py	Removes position-0 embedding masking
lmdeploy/pytorch/model_inputs.py	Adds `target_inputs_embeds`, chunk flags, `clone()`, and makes `step()` functional (non-mutating)
lmdeploy/pytorch/kernels/cuda/pagedattention.py	Casts block offsets to int64 in kernels
lmdeploy/pytorch/kernels/cuda/flatten_kv_cache.py	Minor cleanup (removes stray whitespace)
lmdeploy/pytorch/engine/model_agent/agent.py	Integrates spec-agent into sampling path; passes `misc_config`; chunk-output lifecycle tweaks; shields async postprocess
lmdeploy/pytorch/engine/inputs_maker.py	Tracks multimodal presence for chunking; schedules with `num_required_tokens`; sets chunk flags
lmdeploy/pytorch/engine/executor/base.py	Changes default `num_state_caches` sizing
lmdeploy/pytorch/engine/executor/init.py	Passes `spec_method/num_spec_tokens/block_size` into model config building
lmdeploy/pytorch/engine/engine_loop.py	Adjusts logprobs aggregation to support multi-token steps (spec decode)
lmdeploy/pytorch/configurations/qwen3_5.py	Adds spec/draft model config handling; adjusts state shapes for spec decoding
lmdeploy/pytorch/config.py	Plumbs `num_spec_tokens` + `block_size` into model config construction
lmdeploy/pytorch/backends/gated_delta_rule.py	Extends gated-delta interface to accept `spec_state_offsets`
lmdeploy/pytorch/backends/cuda/op_backend.py	Uses `model_config.block_size` for flash-attn metadata
lmdeploy/pytorch/backends/cuda/graph_runner.py	Stores `block_size` in CUDA graph meta
lmdeploy/pytorch/backends/cuda/gated_delta_rule.py	Adds Triton select/scatter for spec-state offsets; plumbs cache seqlens to recurrent path
lmdeploy/pytorch/backends/cuda/causal_conv1d.py	Extends conv update to accept `cache_seqlens`
lmdeploy/pytorch/backends/causal_conv1d.py	Extends conv update interface to accept `cache_seqlens`
lmdeploy/cli/utils.py	Adds `qwen3_5_mtp` to `--speculative-algorithm` choices
benchmark/profile_throughput.py	Adds speculative decode CLI parsing and passes config to engine
benchmark/profile_pipeline_api.py	Adds speculative decode CLI parsing and passes config into pipeline

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

lmdeploy/pytorch/spec_decode/spec_agent.py

lmdeploy/utils.py

Copilot · 2026-04-03T04:17:34Z

lmdeploy/pytorch/spec_decode/reject_sampler.py

+    target_probs = target_logits.softmax(dim=-1, dtype=torch.float32)
+
+    # 3. Uniform random [batch, num_spec] (float64 to avoid exact 0.0)
+    uniform_probs = torch.rand(
+        (batch_size, num_spec_tokens),
+        dtype=torch.float64,
+        device=device,
+    )
+
+    # 4. Recovered tokens via Gumbel-max trick
+    q = torch.empty(
+        (batch_size, vocab_size),
+        dtype=torch.float32,
+        device=device,
+    )
+    q.exponential_()
+    inv_q = q.reciprocal()
+


The random-rejection path uses torch.rand(...) and q.exponential_() without using sampling_inputs.random_seeds/random_offsets, so results will depend on the global RNG state and won’t be reproducible per-request (unlike the normal sampling path which uses seeded multinomial_sampling). Please wire in sampling_inputs RNG (seeds/offsets) for both uniform_probs and the Gumbel/exponential noise so speculative decoding stays deterministic under the same sampling inputs.

may improve in another pr

Copilot · 2026-04-03T04:17:34Z

tests/pytorch/spec_decode/test_spec_agent.py

+device = 'cuda' if torch.cuda.is_available() else 'cpu'
+
+
+def _run_async(coro):
+    """Helper to run async function in sync test."""
+    loop = asyncio.new_event_loop()
+    try:
+        return loop.run_until_complete(coro)
+    finally:


This test file falls back to device='cpu' when CUDA is unavailable, but the exercised code path (FusedLogitsProcessor via async_sampling_logits) unconditionally uses CUDA stream APIs (torch.cuda.current_stream()), so it will error on non-CUDA runners. Add a pytest.mark.skipif(not torch.cuda.is_available(), ...) (or otherwise guard) so CPU-only CI doesn’t fail.

Copilot · 2026-04-03T04:17:35Z

tests/pytorch/spec_decode/test_reject_sample.py

+device = 'cuda' if torch.cuda.is_available() else 'cpu'
+
+
+def _make_peaked_logits(token_ids_2d, vocab_size):
+    """Build logits where argmax(dim=-1) == token_ids_2d.
+
+    token_ids_2d: list[list[int]] or Tensor [batch, num_spec]
+    """


These tests fall back to CPU when CUDA isn’t available, but rejection_sample and the direct kernel invocations rely on Triton CUDA kernels. Without a skip guard, CPU-only CI will fail. Add pytest.mark.skipif(not torch.cuda.is_available(), ...) (and/or a Triton availability check) around these tests/classes.

lmdeploy/pytorch/strategies/ar_spec/model_agent.py

lvhan028 · 2026-04-03T10:09:50Z

lmdeploy/pytorch/engine/executor/base.py

            # add more caches for eviction
            # TODO: Share memory between state cache and pageable cache
-            num_state_caches = int(cache_config.max_batches + 8)
+            num_state_caches = int(cache_config.max_batches + 1)


Could you comment on the "+1" here?

just allocate one more state to be used for padding.

grimoire and others added 30 commits March 2, 2026 17:55

built-in mrope pos ids support

34d4fa6

Merge branch 'main' into builtin-mrope

aca5801

fix qwen2

d564010

remove make and fill buffer

e35c905

fix context

dc5f052

Merge branch 'main' into builtin-mrope

274f362

update typehint

3a46034

fix comment of copilot

2146481

fix comment of copilot2

db5daf0

remove index select

1d5224b

rename mmdata

51b4b17

merge main

69b2328

merge main

3e87808

merge main again;upgrade transformers

3cb8ad4

support cache_seqlen on recurrent-gdr and causal-conv1d-update

6dd18a2

update shape

c65eaf1

optimize fp32 states

0ecf20f

update kernel

e0e1a4a

update models to support mtp

78b723d

support mtp

5ecbe83

fix offsets

1659805

merge main

7bfe7f0

Merge branch 'main' into builtin-mrope

88c74fa

fix qwen3_5

7ba29b3

Merge branch 'main' into qwen3.5-rl-kernel

64cbcb5

merge main

b26c8cf

return experts

b0b00f0

add bound check

6375876

merge main

3174fb8

merge main

f6d5397

grimoire and others added 17 commits March 26, 2026 14:10

fix typo

198c94d

causal conv1d read init state

bc30400

support runtime quant

a569b57

Merge branch 'main' into fix-qwen35-fp8

a5b4118

pac loader

3296890

support random sampling for reject_sampler

df6e8fe

Merge remote-tracking branch 'yq/fix-qwen35-fp8' into qwen35-mtp-dev

8223f3d

fix conv states all init for chunked prefill

7aa261f

Merge remote-tracking branch 'upstream/main' into qwen35-mtp-dev

a2f3657

fix fa3 metadata in cudagraph

290842d

remove unused code

ad32844

disable fla cp_backend

5fe998b

fix chunk multi modal inputs with mtp

148c139

Merge remote-tracking branch 'yq/opt-fla042' into qwen35-mtp-dev

f20d4fe

fix mropoe for draft

b7b0dd6

fix lint

8efa439

Merge remote-tracking branch 'upstream/main' into qwen35-mtp-dev

eb7db85

RunningLeon removed the WIP label Apr 3, 2026

RunningLeon changed the title ~~[WIP]: qwen35 mtp~~ [Feat]: Support qwen35 with mtp Apr 3, 2026

RunningLeon marked this pull request as ready for review April 3, 2026 04:08

Copilot AI review requested due to automatic review settings April 3, 2026 04:08

Copilot started reviewing on behalf of RunningLeon April 3, 2026 04:09 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

RunningLeon requested review from grimoire and lvhan028 April 3, 2026 04:19

RunningLeon added the planned feature label Apr 3, 2026

lvhan028 reviewed Apr 3, 2026

View reviewed changes

lvhan028 approved these changes Apr 3, 2026

View reviewed changes

Merge remote-tracking branch 'upstream/main' into qwen35-mtp-dev

df53eb7

lvhan028 merged commit 12c877c into InternLM:main Apr 3, 2026
4 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat]: Support qwen35 with mtp#4437

[Feat]: Support qwen35 with mtp#4437
lvhan028 merged 63 commits intoInternLM:mainfrom
RunningLeon:qwen35-mtp-dev

RunningLeon commented Mar 20, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

RunningLeon Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Uh oh!

lvhan028 Apr 3, 2026

Uh oh!

RunningLeon Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

RunningLeon commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

api_server

pipeline

Modification

BC-breaking (Optional)

Use cases (Optional)

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

RunningLeon Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lvhan028 Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

RunningLeon Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

RunningLeon commented Mar 20, 2026 •

edited

Loading