[Feature] Add TurboMind support for Qwen3.5 models (dense + MoE) by lapy · Pull Request #4389 · InternLM/lmdeploy

lapy · 2026-03-02T20:54:28Z

[Feature] Add TurboMind support for Qwen3.5 models (dense + MoE)

Add TurboMind backend support for Qwen3.5 dense and MoE architectures, featuring Gated DeltaNet linear attention with mixed full/linear layers.

Key changes:

C++ Gated DeltaNet layer implementation with fused CUDA kernels for short convolution, decay gate, and delta rule recurrence
Head dimension 256 support for attention and decoding kernels
Mixed attention layer types (linear/full) via per-layer configuration
Linear attention weight export (in_proj_a, in_proj_b, conv1d, etc.)
AWQ mixed quantization support: dequantize attention O-proj to fp16 when QKV are already fp16 (modules_to_not_convert handling)
Separate weight_type / ffn_weight_type / expert_weight_type tracking in converter and C++ weight loading for mixed-quantization models
Model registration for Qwen3_5ForConditionalGeneration and Qwen3_5MoeForConditionalGeneration architectures

Tested with Qwen3.5-35B-A3B-AWQ (MoE) and Qwen3.5-27B-AWQ (dense).

Motivation

Qwen3.5 introduces a new hybrid architecture that alternates between Gated DeltaNet linear attention and standard full attention layers. This is a fundamentally different attention mechanism from what TurboMind currently supports — it replaces softmax attention with a linear recurrence (delta rule) combined with a short convolution and gating, enabling sub-quadratic sequence processing while retaining strong performance.

Qwen3.5 ships in two variants:

Qwen3.5-27B — dense (non-MoE), 64 layers, mixed linear/full attention (every 4th layer is full attention)
Qwen3.5-35B-A3B — MoE with shared experts, same hybrid attention pattern

Both variants use head dimension 256 (vs the typical 128) and feature AWQ quantization with mixed precision: QKV projections are kept in fp16 (modules_to_not_convert) while the O projection and FFN/MoE layers are quantized to int4. This PR adds full TurboMind backend support for both variants.

Modification

C++ / CUDA (Engine)

Gated DeltaNet layer (GatedDeltaNetLayer.cc/h): New layer type implementing the linear attention forward pass — short 1D convolution → SiLU gate → delta rule recurrence with exponential decay, producing output via a gated projection. Manages its own CUDA workspace allocations for recurrence state and intermediates.
Fused CUDA kernels (gated_delta_net_kernels.cu/h): GPU kernels for:
- Short convolution with SiLU activation (fused conv1d + silu)
- Decay gate computation (log(1 - sigmoid(x)) numerically stable path)
- Delta rule recurrence (per-head state matrix update with decay and delta)
- Gated output projection
Head dimension 256 attention/decoding codegen: New instantiation files for sm70/sm75/sm80 attention and decoding kernels at HeadDim=256, for all KV quantization variants (f16, u4, u8, bf16).
Mixed layer type support (unified_decoder.cc/h, llama_params.h): Per-layer layer_type array (0=full attention, 1=linear attention) read from model config, dispatching to either UnifiedAttentionLayer or GatedDeltaNetLayer at each decoder step.
Weight loading (LlamaDecoderLayerWeight.cc/h, LlamaDenseWeight.cc/h): Linear attention weights (in_proj_a, in_proj_b, conv1d, out_proj, gate) loaded alongside traditional attention weights. Added ffn_weight_type / expert_weight_type separation so that mixed-quantization models correctly use int4 for FFN layers even when attention weight_type is fp16.
RMS norm head_dim=256: Extended kernel instantiation to support the new head dimension.

Python (Converter / Deployment)

Model registration (supported_models.py, archs.py): Registered Qwen3_5ForConditionalGeneration and Qwen3_5MoeForConditionalGeneration architectures.
Config (config.py): Added layer_types, d_conv (convolution width), expand_k/expand_v (KV expansion ratios), attn_output_gate fields to ModelConfig.
Reader (qwen.py): Qwen3_5ReaderMixin with:
- _attn() override that detects mixed-quantization attention (QKV fp16 + O int4 AWQ) and dequantizes O-proj weights to fp16 at conversion time via _awq_dequant().
- linear_attn() method exporting GDN-specific weights (in_proj_a, in_proj_b, conv1d, out_proj, gate), also with on-the-fly AWQ dequantization for linear attention modules that appear in modules_to_not_convert.
- model_info() providing layer type arrays, convolution config, and correct inter_size handling (dense vs MoE).
Module (module.py): New LinearAttn export class for linear attention weight packing; permute_v2_partial() for partial rotary embedding with head_dim 256; merge_qkvg_v2() for Q/K/V/Gate merging; Attn._split_q_gate() for splitting interleaved Q+Gate projections.
Converter (converter.py): mixed_awq detection from quantize_config.modules_to_not_convert; separate weight_type (attention, fp16 for mixed), ffn_weight_type (FFN, int4), and expert_weight_type (MoE experts, int4); per-layer unquantized_expert_layers for layers excluded from quantization (e.g., model.layers.0).

BC-breaking (Optional)

No breaking changes. All modifications are additive — new model architectures, new layer types, and new config fields with defaults that preserve existing behavior. Existing TurboMind models are unaffected.

Use cases (Optional)

import lmdeploy

# Dense 27B AWQ model
pipe = lmdeploy.pipeline("QuantTrio/Qwen3.5-27B-AWQ", backend_config=lmdeploy.TurbomindEngineConfig(model_format="awq"))
print(pipe(["Hi, tell me about yourself"]))

# MoE 35B-A3B AWQ model
pipe = lmdeploy.pipeline("QuantTrio/Qwen3.5-35B-A3B-AWQ", backend_config=lmdeploy.TurbomindEngineConfig(model_format="awq"))
print(pipe(["Explain quantum computing in simple terms"]))

Checklist

Pre-commit or other linting tools are used to fix the potential lint issues.
- flake8, isort, yapf, codespell all pass on changed Python files
- clang-format applied to all changed C++/CUDA files
- docformatter skipped (requires Python 3.10 not available in dev env; CI will validate)
The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
- Existing converter unit tests pass (5/5). End-to-end tested manually with both model variants. Dedicated unit tests for GDN kernels and linear attention weight export are not yet added.
If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
- No new dependencies. Uses existing dequantize_gemm from lmdeploy.pytorch.backends.default.awq_modules.
The documentation has been modified accordingly, like docstring or example tutorials.
- Documentation for Qwen3.5 support should be added to docs/ in a follow-up.

Test Results

Model	Type	Quantization	Result
`QuantTrio/Qwen3.5-35B-A3B-AWQ`	MoE (8 experts, top-2, shared expert)	AWQ int4 (attention fp16)	Coherent output, thinking tags working
`QuantTrio/Qwen3.5-27B-AWQ`	Dense, 64 layers	AWQ int4 mixed (QKV fp16, O int4 → dequantized)	Coherent output, thinking tags working

Copilot

Pull request overview

Adds TurboMind backend support for Qwen3.5 dense + MoE models by introducing Gated DeltaNet linear attention, head_dim=256 kernel support, and converter/export updates for mixed attention + mixed quantization.

Changes:

Implemented Gated DeltaNet linear-attention layer + CUDA kernels and integrated it into the UnifiedDecoder execution path.
Extended attention/decoding and RMSNorm kernels to support head_dim=256 and added optional attention output gating.
Updated Python converter/reader/export modules to register Qwen3.5 architectures, export linear-attn weights, and handle mixed AWQ quantization and per-layer unquantized overrides.

Reviewed changes

Copilot reviewed 53 out of 53 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
src/turbomind/turbomind.cc	Parses new Qwen3.5 layer_types and linear-attention config fields into ModelParam
src/turbomind/models/llama/unified_decoder.h	Adds GatedDeltaNetLayer member to decoder
src/turbomind/models/llama/unified_decoder.cc	Dispatches per-layer attention vs linear-attention and forwards correct bias to residual path
src/turbomind/models/llama/unified_attention_layer.cc	Adds optional output gating and adjusts QKV stride accounting
src/turbomind/models/llama/moe_ffn_layer.cc	Routes sigmoid scoring func to compatible MoE gate kernel path
src/turbomind/models/llama/llama_params.h	Adds ModelParam fields for layer_types, linear-attn params, and unquantized_expert_layers
src/turbomind/models/llama/llama_kernels.h	Declares sigmoid gate multiply kernel entry point
src/turbomind/models/llama/llama_kernels.cu	Implements sigmoid gate multiply CUDA kernel
src/turbomind/models/llama/gated_delta_net_kernels.h	Declares CUDA kernels for Gated DeltaNet conv/recurrence/norm utilities
src/turbomind/models/llama/gated_delta_net_kernels.cu	Implements CUDA kernels for Gated DeltaNet operations
src/turbomind/models/llama/LlamaDenseWeight.h	Extends attention weight ctor to accept attn_output_gate flag
src/turbomind/models/llama/LlamaDenseWeight.cc	Adjusts QKV output dim when attn_output_gate is enabled
src/turbomind/models/llama/LlamaDecoderLayerWeight.h	Adds GatedDeltaNetWeight pointer to layer weights
src/turbomind/models/llama/LlamaDecoderLayerWeight.cc	Instantiates linear-attn weights conditionally and applies mixed-quantization type routing
src/turbomind/models/llama/GatedDeltaNetWeight.h	Introduces weight container for Gated DeltaNet tensors
src/turbomind/models/llama/GatedDeltaNetWeight.cc	Implements parameter registration and prepare() for linear-attn weights
src/turbomind/models/llama/GatedDeltaNetLayer.h	Declares GatedDeltaNet execution layer and request state handling
src/turbomind/models/llama/GatedDeltaNetLayer.cc	Implements linear-attn forward pass with per-request persistent state
src/turbomind/models/llama/CMakeLists.txt	Adds new GatedDeltaNet sources to Llama static lib
src/turbomind/models/CMakeLists.txt	Adds new GatedDeltaNet sources to models static lib
src/turbomind/kernels/norm/rms_norm.cu	Extends QK RMSNorm launcher to max_dim=256
src/turbomind/kernels/core/thread_map.h	Clamps WarpThreadC default to WARP_SIZE for larger DimC
src/turbomind/kernels/attention/reduce.cu	Adds ReduceV3 instantiations for head_dim=256
src/turbomind/kernels/attention/kv_cache_utils_v2.cu	Adds KV cache processing/flattening dispatch for head_dim=256
src/turbomind/kernels/attention/decoding.cu	Adds decoding dispatch for size_per_head=256
src/turbomind/kernels/attention/codegen/decoding_sm80_256_f16_u8.cu	Adds SM80 decoding codegen instantiations (f16/u8, head_dim=256)
src/turbomind/kernels/attention/codegen/decoding_sm80_256_f16_u4.cu	Adds SM80 decoding codegen instantiations (f16/u4, head_dim=256)
src/turbomind/kernels/attention/codegen/decoding_sm80_256_f16_f16.cu	Adds SM80 decoding codegen instantiations (f16/f16, head_dim=256)
src/turbomind/kernels/attention/codegen/decoding_sm80_256_bf16_u8.cu	Adds SM80 decoding codegen instantiations (bf16/u8, head_dim=256)
src/turbomind/kernels/attention/codegen/decoding_sm80_256_bf16_u4.cu	Adds SM80 decoding codegen instantiations (bf16/u4, head_dim=256)
src/turbomind/kernels/attention/codegen/decoding_sm80_256_bf16_bf16.cu	Adds SM80 decoding codegen instantiations (bf16/bf16, head_dim=256)
src/turbomind/kernels/attention/codegen/decoding_sm75_256_f16_u8.cu	Adds SM75 decoding codegen instantiations (f16/u8, head_dim=256)
src/turbomind/kernels/attention/codegen/decoding_sm75_256_f16_u4.cu	Adds SM75 decoding codegen instantiations (f16/u4, head_dim=256)
src/turbomind/kernels/attention/codegen/decoding_sm75_256_f16_f16.cu	Adds SM75 decoding codegen instantiations (f16/f16, head_dim=256)
src/turbomind/kernels/attention/codegen/decoding_sm70_256_f16_u8.cu	Adds SM70 decoding codegen instantiations (f16/u8, head_dim=256)
src/turbomind/kernels/attention/codegen/decoding_sm70_256_f16_u4.cu	Adds SM70 decoding codegen instantiations (f16/u4, head_dim=256)
src/turbomind/kernels/attention/codegen/decoding_sm70_256_f16_f16.cu	Adds SM70 decoding codegen instantiations (f16/f16, head_dim=256)
src/turbomind/kernels/attention/codegen/attention_sm80_256_f16.cu	Adds SM80 attention codegen instantiations (f16, head_dim=256)
src/turbomind/kernels/attention/codegen/attention_sm80_256_bf16.cu	Adds SM80 attention codegen instantiations (bf16, head_dim=256)
src/turbomind/kernels/attention/codegen/attention_sm75_256_f16.cu	Adds SM75 attention codegen instantiations (f16, head_dim=256)
src/turbomind/kernels/attention/codegen/attention_sm70_256_f16.cu	Adds SM70 attention codegen instantiations (f16, head_dim=256)
src/turbomind/kernels/attention/attention_config.h	Adds head_dim=256 attention config specializations for SM70/SM75
src/turbomind/kernels/attention/attention.cu	Adds attention dispatch for size_per_head=256
src/turbomind/kernels/attention/CMakeLists.txt	Adds new 256 codegen compilation units
src/turbomind/engine/request.h	Adds persistent per-request Gated DeltaNet state tensors
src/turbomind/core/module.h	Adds include guards and minor formatting adjustments
lmdeploy/turbomind/supported_models.py	Registers Qwen3.5 architectures for TurboMind
lmdeploy/turbomind/deploy/source_model/qwen.py	Adds Qwen3.5 reader/model info including linear-attn export and mixed-AWQ handling
lmdeploy/turbomind/deploy/source_model/llama.py	Makes intermediate_size optional in config parsing
lmdeploy/turbomind/deploy/module.py	Adds partial RoPE permute, Q+gate splitting, QKVG merge, and LinearAttn export module
lmdeploy/turbomind/deploy/converter.py	Adds per-layer unquantized overrides and avoids overwriting TP sizes when engine config fields are None
lmdeploy/turbomind/deploy/config.py	Adds Qwen3.5 config fields (layer_types, linear-attn params, gating, unquantized_expert_layers)
lmdeploy/archs.py	Treats Qwen3.5 architectures as non-VL LLMs in VL detection

Comments suppressed due to low confidence (7)

src/turbomind/models/llama/gated_delta_net_kernels.cu:1

The causal_conv1d_* kernels are launched with a 2D grid (grid.x=blocks, grid.y=batch_size) but the kernel indexing ignores blockIdx.y, causing each y-slice to redundantly process the full batch and race on out/conv_states. Fix by incorporating blockIdx.y into the global index (or use blockIdx.y as the batch index and only iterate channels in x), and set total to batch_size * conv_dim (and * seq_len for prefill) without relying on gridDim.y.
src/turbomind/models/llama/gated_delta_net_kernels.cu:1
batch_size is currently unused in invokeFusedConv1dSiLU, which can trigger unused-parameter warnings (and potentially break builds that treat warnings as errors). Either remove it from the API or explicitly mark it unused (e.g., (void)batch_size), or implement the intended state offsetting logic using batch_size.
src/turbomind/models/llama/GatedDeltaNetLayer.cc:1
GatedDeltaNetLayer::Forward performs multiple GPU tensor allocations inside the per-request loop (conv_out, q/k/v_contig, optional expanded buffers). This will add significant allocator overhead at runtime (especially for decode where this runs every step). Consider switching these to a reusable workspace/scratch allocation (e.g., per-phase buffers sized to max tokens/seq_len) or using an existing workspace allocator to avoid repeated cudaMalloc/free patterns.
src/turbomind/models/llama/GatedDeltaNetLayer.cc:1
GatedDeltaNetLayer::Forward performs multiple GPU tensor allocations inside the per-request loop (conv_out, q/k/v_contig, optional expanded buffers). This will add significant allocator overhead at runtime (especially for decode where this runs every step). Consider switching these to a reusable workspace/scratch allocation (e.g., per-phase buffers sized to max tokens/seq_len) or using an existing workspace allocator to avoid repeated cudaMalloc/free patterns.
src/turbomind/models/llama/GatedDeltaNetLayer.cc:1
GatedDeltaNetLayer::Forward performs multiple GPU tensor allocations inside the per-request loop (conv_out, q/k/v_contig, optional expanded buffers). This will add significant allocator overhead at runtime (especially for decode where this runs every step). Consider switching these to a reusable workspace/scratch allocation (e.g., per-phase buffers sized to max tokens/seq_len) or using an existing workspace allocator to avoid repeated cudaMalloc/free patterns.
src/turbomind/models/llama/GatedDeltaNetLayer.cc:1
GatedDeltaNetLayer::Forward performs multiple GPU tensor allocations inside the per-request loop (conv_out, q/k/v_contig, optional expanded buffers). This will add significant allocator overhead at runtime (especially for decode where this runs every step). Consider switching these to a reusable workspace/scratch allocation (e.g., per-phase buffers sized to max tokens/seq_len) or using an existing workspace allocator to avoid repeated cudaMalloc/free patterns.
src/turbomind/models/llama/GatedDeltaNetWeight.cc:1
These dimensions use integer division by tp_size without validating divisibility. If num_k_heads/num_v_heads aren't divisible by tp_size, this will silently truncate and mis-shape weights/state. Add explicit checks (e.g., TM_CHECK_EQ(num_k_heads % tp_size, 0) and TM_CHECK_EQ(num_v_heads % tp_size, 0)) to fail fast with a clear error.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

lmdeploy/turbomind/deploy/source_model/qwen.py

lmdeploy/turbomind/deploy/module.py

src/turbomind/kernels/norm/rms_norm.cu

- Fix tensor parallelism in module.py with proper assertions and validation - Add assertions for rotation dimension validation - Ensure rotary_dim is even for proper reshaping - Add bounds checking for rotary_dim vs size_per_head - Add divisibility check for output dimensions - Fix attribute name typo in qwen.py - Correct 'attn_layer_patten' to 'attn_layer_pattern' in Qwen3_5ReaderMixin - Improve MSVC compiler compatibility in rms_norm.cu - Use std::decay_t for proper template type deduction across compilers

tuilakhanh · 2026-03-03T09:36:48Z

Got garbage output with V100 both dense and MoE model.

lapy · 2026-03-03T09:39:12Z

Got garbage output with V100 both dense and MoE model.

Hi, thanks for testing it. Did you try latest commit? There was a bug earlier for TP > 1. I am running TP = 2 with no issues with 2x V100 32GB

tuilakhanh · 2026-03-03T11:05:29Z

Hi, thanks for testing it. Did you try latest commit? There was a bug earlier for TP > 1. I am running TP = 2 with no issues with 2x V100 32GB

Latest commit and both TP=1 and TP>=1.
Maybe cause by transformer 5?

check_env:

Details

sys.platform: linux Python: 3.10.12 (main, Jan 26 2026, 14:55:28) [GCC 11.4.0] CUDA available: True MUSA available: False numpy_random_seed: 2147483648 GPU 0,1,2,3: Tesla V100-SXM2-32GB CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 12.8, V12.8.93 GCC: x86_64-linux-gnu-gcc (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0 PyTorch: 2.10.0+cu128 PyTorch compiling details: PyTorch built with: - GCC 13.3 - C++ Version: 201703 - Intel(R) oneAPI Math Kernel Library Version 2024.2-Product Build 20240605 for Intel(R) 64 architecture applications - Intel(R) MKL-DNN v3.7.1 (Git Hash 8d263e693366ef8db40acc569cc7d8edf644556d) - OpenMP 201511 (a.k.a. OpenMP 4.5) - LAPACK is enabled (usually provided by MKL) - NNPACK is enabled - CPU capability usage: AVX512 - CUDA Runtime 12.8 - NVCC architecture flags: -gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90;-gencode;arch=compute_100,code=sm_100;-gencode;arch=compute_120,code=sm_120 - CuDNN 91.0.2 (built against CUDA 12.9) - Magma 2.6.1 - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, COMMIT_SHA=449b1768410104d3ed79d3bcfe4ba1d65c7f22c0, CUDA_VERSION=12.8, CUDNN_VERSION=9.10.2, CXX_COMPILER=/opt/rh/gcc-toolset-13/root/usr/bin/c++, CXX_FLAGS= -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=ON -DUSE_FBGEMM -DUSE_FBGEMM_GENAI -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -DC10_NODEPRECATED -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -faligned-new -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-dangling-reference -Wno-error=dangling-reference -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, TORCH_VERSION=2.10.0, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, USE_XCCL=OFF, USE_XPU=OFF,

TorchVision: 0.25.0+cu128
LMDeploy: 0.12.1+
transformers: 5.2.0
fastapi: 0.135.1
pydantic: 2.12.5
triton: 3.6.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV2 NV2 NV1 20-39,60-79 1 N/A
GPU1 NV2 X NV1 NV2 20-39,60-79 1 N/A
GPU2 NV2 NV1 X NV1 20-39,60-79 1 N/A
GPU3 NV1 NV2 NV1 X 20-39,60-79 1 N/A

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

lapy · 2026-03-03T11:30:40Z

Hi, thanks for testing it. Did you try latest commit? There was a bug earlier for TP > 1. I am running TP = 2 with no issues with 2x V100 32GB

Latest commit and both TP=1 and TP>=1. Maybe cause by transformer 5?

check_env:

Details

I'll take a better look in a few hours after work, can you please add what command did you use to test?

I am testing with e.g.:

lmdeploy chat /workspace/Qwen3.5-27B-AWQ --model-format awq --tp 2 --cache-max-entry-count 0.3 --session-len 16000

tuilakhanh · 2026-03-03T12:09:25Z

All default, just add --tp. Model tested:
Qwen/Qwen3.5-2B
Qwen/Qwen3.5-35B-A3B
Qwen/Qwen3.5-27B

lapy · 2026-03-03T12:42:08Z

All default, just add --tp. Model tested: Qwen/Qwen3.5-2B Qwen/Qwen3.5-35B-A3B Qwen/Qwen3.5-27B

I haven’t tested the full Qwen models, only QuantTrio AWQ quants (refer to the end of the PR description for the model names). Please use QuantTrio for now if you want to test this PR.

tuilakhanh · 2026-03-03T13:01:02Z

All default, just add --tp. Model tested: Qwen/Qwen3.5-2B Qwen/Qwen3.5-35B-A3B Qwen/Qwen3.5-27B

I haven’t tested the full Qwen models, only QuantTrio AWQ quants (refer to the end of the PR description for the model names). Please use QuantTrio for now if you want to test this PR.

Confirmed working with QuantTrio/Qwen3.5-122B-A10B-AWQ and x4 v100s.

lapy · 2026-03-03T13:06:37Z

All default, just add --tp. Model tested: Qwen/Qwen3.5-2B Qwen/Qwen3.5-35B-A3B Qwen/Qwen3.5-27B

I haven’t tested the full Qwen models, only QuantTrio AWQ quants (refer to the end of the PR description for the model names). Please use QuantTrio for now if you want to test this PR.

Confirmed working with QuantTrio/Qwen3.5-122B-A10B-AWQ and x4 v100s.

What’s the performance for you? Both prompt processing and generation?

tuilakhanh · 2026-03-03T13:10:53Z

What’s the performance for you? Both prompt processing and generation?

From log:

[2026-03-03 20:02:35 DP0] Avg thr (in/out): 0.0 / 21.9 tokens/s, API server (completed/routed/waiting): 2 / 1 / 0, Engine (running/waiting): 1 / 0, KV cache: 4.3%, 
[2026-03-03 20:02:45 DP0] Avg thr (in/out): 0.0 / 21.9 tokens/s, API server (completed/routed/waiting): 2 / 1 / 0, Engine (running/waiting): 1 / 0, KV cache: 4.4%, 
[2026-03-03 20:02:55 DP0] Avg thr (in/out): 0.0 / 20.3 tokens/s, API server (completed/routed/waiting): 3 / 0 / 0, Engine (running/waiting): 1 / 0, KV cache: 4.6%, 
INFO:     10.69.1.96:49276 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[TM][WARNING] [ProcessInferRequests] [4] total sequence length (7839 + 254305) exceeds `session_len` (209280), `max_new_tokens` is truncated to 201441
INFO:     10.69.1.96:33096 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[TM][WARNING] [ProcessInferRequests] [5] total sequence length (389 + 261755) exceeds `session_len` (209280), `max_new_tokens` is truncated to 208891
[2026-03-03 20:03:25 DP0] Avg thr (in/out): 822.7 / 20.2 tokens/s, API server (completed/routed/waiting): 3 / 2 / 0, Engine (running/waiting): 2 / 0, KV cache: 4.1%, 
[2026-03-03 20:03:35 DP0] Avg thr (in/out): 0.0 / 37.0 tokens/s, API server (completed/routed/waiting): 3 / 2 / 0, Engine (running/waiting): 2 / 0, KV cache: 4.3%, 
[2026-03-03 20:03:45 DP0] Avg thr (in/out): 0.0 / 37.0 tokens/s, API server (completed/routed/waiting): 3 / 2 / 0, Engine (running/waiting): 2 / 0, KV cache: 4.4%, 
[2026-03-03 20:03:55 DP0] Avg thr (in/out): 0.0 / 36.8 tokens/s, API server (completed/routed/waiting): 3 / 2 / 0, Engine (running/waiting): 2 / 0, KV cache: 4.6%, 
[2026-03-03 20:04:05 DP0] Avg thr (in/out): 0.0 / 37.0 tokens/s, API server (completed/routed/waiting): 3 / 2 / 0, Engine (running/waiting): 2 / 0, KV cache: 4.8%, 
[2026-03-03 20:04:15 DP0] Avg thr (in/out): 0.0 / 36.8 tokens/s, API server (completed/routed/waiting): 3 / 2 / 0, Engine (running/waiting): 2 / 0, KV cache: 4.9%, 
[2026-03-03 20:04:25 DP0] Avg thr (in/out): 0.0 / 36.8 tokens/s, API server (completed/routed/waiting): 3 / 2 / 0, Engine (running/waiting): 2 / 0, KV cache: 5.1%, 
[2026-03-03 20:04:35 DP0] Avg thr (in/out): 0.0 / 36.8 tokens/s, API server (completed/routed/waiting): 3 / 2 / 0, Engine (running/waiting): 2 / 0, KV cache: 5.3%, 
[2026-03-03 20:04:45 DP0] Avg thr (in/out): 0.0 / 36.8 tokens/s, API server (completed/routed/waiting): 3 / 2 / 0, Engine (running/waiting): 2 / 0, KV cache: 5.5%, 
[2026-03-03 20:04:55 DP0] Avg thr (in/out): 0.0 / 36.8 tokens/s, API server (completed/routed/waiting): 3 / 2 / 0, Engine (running/waiting): 2 / 0, KV cache: 5.7%, 
[2026-03-03 20:05:05 DP0] Avg thr (in/out): 0.0 / 36.6 tokens/s, API server (completed/routed/waiting): 3 / 2 / 0, Engine (running/waiting): 2 / 0, KV cache: 5.8%, 
[2026-03-03 20:05:15 DP0] Avg thr (in/out): 0.0 / 33.2 tokens/s, API server (completed/routed/waiting): 4 / 1 / 0, Engine (running/waiting): 1 / 0, KV cache: 1.2%, 
[2026-03-03 20:05:25 DP0] Avg thr (in/out): 0.0 / 22.3 tokens/s, API server (completed/routed/waiting): 4 / 1 / 0, Engine (running/waiting): 1 / 0, KV cache: 1.3%, 
[2026-03-03 20:05:35 DP0] Avg thr (in/out): 0.0 / 22.3 tokens/s, API server (completed/routed/waiting): 4 / 1 / 0, Engine (running/waiting): 1 / 0, KV cache: 1.4%, 
[2026-03-03 20:05:45 DP0] Avg thr (in/out): 0.0 / 22.2 tokens/s, API server (completed/routed/waiting): 4 / 1 / 0, Engine (running/waiting): 1 / 0, KV cache: 1.5%,

lvhan028 · 2026-03-04T06:59:18Z

Hi, @lapy
We were absolutely thrilled to see this PR! It's such a fantastic surprise. Thank you for another amazing contribution to the Turbomind Engine!

lvhan028 · 2026-03-04T07:18:40Z

Could you just update from main to fix the clang-format issue?

Add TurboMind backend support for Qwen3.5 dense and MoE architectures, featuring Gated DeltaNet linear attention with mixed full/linear layers. Key changes: - C++ Gated DeltaNet layer implementation with fused CUDA kernels for short convolution, decay gate, and delta rule recurrence - Head dimension 256 support for attention and decoding kernels - Mixed attention layer types (linear/full) via per-layer configuration - Linear attention weight export (in_proj_a, in_proj_b, conv1d, etc.) - AWQ mixed quantization support: dequantize attention O-proj to fp16 when QKV are already fp16 (modules_to_not_convert handling) - Separate weight_type / ffn_weight_type / expert_weight_type tracking in converter and C++ weight loading for mixed-quantization models - Model registration for Qwen3_5ForConditionalGeneration and Qwen3_5MoeForConditionalGeneration architectures Tested with Qwen3.5-35B-A3B-AWQ (MoE) and Qwen3.5-27B-AWQ (dense).

- Fix tensor parallelism in module.py with proper assertions and validation - Add assertions for rotation dimension validation - Ensure rotary_dim is even for proper reshaping - Add bounds checking for rotary_dim vs size_per_head - Add divisibility check for output dimensions - Fix attribute name typo in qwen.py - Correct 'attn_layer_patten' to 'attn_layer_pattern' in Qwen3_5ReaderMixin - Improve MSVC compiler compatibility in rms_norm.cu - Use std::decay_t for proper template type deduction across compilers

Address Windreamer's review: swap dispatch nesting order in invokeQkRMSNorm and invokeRMSNormQK to resolve dtype first (outer) and then launch for different head dims (inner).

This commit introduces several algorithmic and micro-architectural optimizations for the Gated Delta Net implementation, significantly reducing kernel launch overhead and improving memory bandwidth utilization, particularly on SM70 (Volta) hardware. Key Optimizations: 1. Algorithmic Changes: - Single-Launch Prefill: Replaced the serial host-side loop with a parallel chunked-scan prefill kernel (invokeGatedDeltaRulePrefill). - In-Kernel GQA & L2Norm: Grouped-Query Attention handling and block-level L2 normalization are now fused directly into the delta rule kernels, eliminating redundant allocations and round-trips. - Fused Input Projections: Fused 4 independent GEMMs into a single \'in_proj_all\' projection matrix, slicing the output columns directly to avoid extra memory reads. 2. SM70-Specific Enhancements: - Vectorization: Added half2/nv_bfloat162 vectorized memory access to the delta rule, compute_beta_g, and silu kernels, doubling data throughput in memory-bound operations. - Warp-Synchronous Reductions: Optimized block_l2_inv_norm to bypass shared memory and rely purely on __shfl_xor_sync when block size <= 32. - ILP Loop Unrolling: Unrolled d_conv loops in causal Conv1D kernels to improve instruction-level parallelism. Bug Fixes: - Added dynamic striding (in_stride, gate_stride) to ensure kernels correctly access non-contiguous columns output by the fused GEMM.

…saturation

…mplate prompts

lapy · 2026-03-04T17:24:39Z

Could you just update from main to fix the clang-format issue?

Done! :)

.pre-commit-config.yaml

lvhan028 · 2026-03-05T04:05:03Z

Hi, @lapy We were absolutely thrilled to see this PR! It's such a fantastic surprise. Thank you for another amazing contribution to the Turbomind Engine!

Hi @lapy, just a quick follow-up regarding the merge process.
We have some internal PRs (#4391 #4396 ) focused on compilation and inference performance optimizations that need to land first. This may cause conflicts with your wonderful PR.
Please don't worry about resolving them. Our team will take care of the rebase and conflict resolution after merging those changes. We apologize for the slight delay and want to ensure your feature runs on the most optimized base.
Thanks for your patience!

On a separate note, we noticed that there's no email address listed on your GitHub profile, so I wanted to ask here: could you share your email with us? We'd love to learn a bit more about your experience if you don't mind sharing

lapy · 2026-03-05T08:51:51Z

Hi, @lapy We were absolutely thrilled to see this PR! It's such a fantastic surprise. Thank you for another amazing contribution to the Turbomind Engine!

Hi @lapy, just a quick follow-up regarding the merge process. We have some internal PRs (#4391 #4396 ) focused on compilation and inference performance optimizations that need to land first. This may cause conflicts with your wonderful PR. Please don't worry about resolving them. Our team will take care of the rebase and conflict resolution after merging those changes. We apologize for the slight delay and want to ensure your feature runs on the most optimized base. Thanks for your patience!

On a separate note, we noticed that there's no email address listed on your GitHub profile, so I wanted to ask here: could you share your email with us? We'd love to learn a bit more about your experience if you don't mind sharing

Thank you! Just sent you an email :)

tuilakhanh · 2026-03-10T09:13:27Z

Getting this error with v100.

[TM][FATAL] kernels/attention/decoding.cu(34): Check failed: kernel No decoding kernel found: decode_d256_f16_gs4

Copilot AI review requested due to automatic review settings March 2, 2026 20:54

Copilot AI reviewed Mar 2, 2026

View reviewed changes

lmdeploy/turbomind/deploy/source_model/qwen.py Outdated Show resolved Hide resolved

lmdeploy/turbomind/deploy/module.py Show resolved Hide resolved

windreamer force-pushed the qwen3.5 branch from eb6d8ee to 847e04c Compare March 3, 2026 03:44

windreamer reviewed Mar 3, 2026

View reviewed changes

src/turbomind/kernels/norm/rms_norm.cu Outdated Show resolved Hide resolved

windreamer reviewed Mar 3, 2026

View reviewed changes

src/turbomind/kernels/norm/rms_norm.cu Outdated Show resolved Hide resolved

windreamer requested a review from lzhangzz March 3, 2026 10:43

lvhan028 added the enhancement New feature or request label Mar 3, 2026

lapy added 9 commits March 4, 2026 17:10

Fix formatting

69d0ad4

Fix qwen3.5 model loading

b93178a

Refactor QK RMSNorm dispatch: dtype first, then head_dim

4b84d61

Address Windreamer's review: swap dispatch nesting order in invokeQkRMSNorm and invokeRMSNormQK to resolve dtype first (outer) and then launch for different head dims (inner).

Tune SM70 HeadDim=256 attention block config for optimal tensor core …

6653b60

…saturation

Strip explicitly injected <think> tags from Qwen-series apply_chat_te…

ded1dce

…mplate prompts

feat: Add support for Qwen3 Coder XML tool call format

06d9480

lapy force-pushed the qwen3.5 branch from a91dcd4 to b8db495 Compare March 4, 2026 17:23

Fix pre-commit errors

ec98d59

lapy force-pushed the qwen3.5 branch from b8db495 to ec98d59 Compare March 4, 2026 17:32

windreamer reviewed Mar 4, 2026

View reviewed changes

.pre-commit-config.yaml Outdated Show resolved Hide resolved

Revert Python version for docformatter to 3.10

247e088

Readon mentioned this pull request Mar 7, 2026

bump version to v0.12.2 #4378

Open

9 tasks

lvhan028 added 3 commits March 9, 2026 04:21

merge main and resolve the conflicts

291cadd

remove unused file

7d6ae4f

fix qwen3.5 moe reader

219fea7

lvhan028 force-pushed the qwen3.5 branch from 37a02eb to 219fea7 Compare March 9, 2026 11:57

lvhan028 added 2 commits March 9, 2026 15:34

rollback model.py, <think> cannot be stripped

4d7e93c

fix linting issue

c1b2802

lvhan028 approved these changes Mar 10, 2026

View reviewed changes

lzhangzz approved these changes Mar 10, 2026

View reviewed changes

lvhan028 merged commit af88290 into InternLM:main Mar 10, 2026
9 checks passed

Conversation

lapy commented Mar 2, 2026

[Feature] Add TurboMind support for Qwen3.5 models (dense + MoE)

Motivation

Modification

C++ / CUDA (Engine)

Python (Converter / Deployment)

BC-breaking (Optional)

Use cases (Optional)

Checklist

Test Results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tuilakhanh commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lapy commented Mar 3, 2026

Uh oh!

tuilakhanh commented Mar 3, 2026

Uh oh!

lapy commented Mar 3, 2026

Uh oh!

tuilakhanh commented Mar 3, 2026

Uh oh!

lapy commented Mar 3, 2026

Uh oh!

tuilakhanh commented Mar 3, 2026

Uh oh!

lapy commented Mar 3, 2026

Uh oh!

tuilakhanh commented Mar 3, 2026

Uh oh!

lvhan028 commented Mar 4, 2026

Uh oh!

lvhan028 commented Mar 4, 2026

Uh oh!

lapy commented Mar 4, 2026

Uh oh!

Uh oh!

lvhan028 commented Mar 5, 2026

Uh oh!

lapy commented Mar 5, 2026

Uh oh!

Uh oh!

tuilakhanh commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

tuilakhanh commented Mar 3, 2026 •

edited

Loading