Conversation
|
Remove enforce-eager can enable cudagraph, and with cudagraph enabled, we get following gsm8k score: |
There was a problem hiding this comment.
Remaining comments which cannot be posted as a review comment to avoid GitHub Rate Limit
ruff
Do not assign a lambda expression, use a def
typing.Any imported but unused
ATOM/atom/models/qwen3_next.py
Line 1 in bcc6170
typing.Dict imported but unused
ATOM/atom/models/qwen3_next.py
Line 1 in bcc6170
typing.Iterable imported but unused
ATOM/atom/models/qwen3_next.py
Line 1 in bcc6170
typing.Set imported but unused
ATOM/atom/models/qwen3_next.py
Line 1 in bcc6170
Redefinition of unused Optional from line 1: Optional redefined here
ATOM/atom/models/qwen3_next.py
Line 10 in bcc6170
transformers.Qwen3Config imported but unused
ATOM/atom/models/qwen3_next.py
Line 11 in bcc6170
transformers.PretrainedConfig imported but unused
ATOM/atom/models/qwen3_next.py
Line 12 in bcc6170
atom.model_ops.layernorm.RMSNorm imported but unused
ATOM/atom/models/qwen3_next.py
Line 20 in bcc6170
atom.utils.forward_context.ForwardContext imported but unused
ATOM/atom/models/qwen3_next.py
Line 38 in bcc6170
Redefinition of unused get_tp_group from line 9: get_tp_group redefined here
ATOM/atom/models/qwen3_next.py
Line 54 in bcc6170
aiter.dist.parallel_state.get_tp_group imported but unused
ATOM/atom/models/qwen3_next.py
Line 54 in bcc6170
Undefined name MambaStateDtypeCalculator
ATOM/atom/models/qwen3_next.py
Line 420 in bcc6170
Undefined name MambaStateShapeCalculator
ATOM/atom/models/qwen3_next.py
Line 425 in bcc6170
Local variable kv_cache_data is assigned to but never used
ATOM/atom/models/qwen3_next.py
Line 714 in bcc6170
Undefined name self
ATOM/atom/models/qwen3_next.py
Line 1226 in bcc6170
Redefinition of unused extract_layer_index from line 217: extract_layer_index redefined here
Line 258 in bcc6170
There was a problem hiding this comment.
Pull request overview
This PR adds support for the Qwen3-Next model architecture on the ATOM framework, specifically the Qwen3-Next-80B-A3B-Thinking variant. The implementation focuses on supporting the model's unique gated delta net (GDN) linear attention mechanism alongside traditional full attention layers.
Changes:
- Added Qwen3-Next model configuration and architecture implementation
- Implemented GDN (Gated Delta Net) attention backend with Mamba SSM operations
- Extended model loader and runner to handle mixed attention architectures
- Added supporting utilities for layer index extraction and state management
Reviewed changes
Copilot reviewed 35 out of 38 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| atom/utils/selector.py | Added use_gdn parameter to attention backend selection |
| atom/utils/forward_context.py | Added GDNAttentionMetadata dataclass for GDN attention state |
| atom/models/utils.py | Added extract_layer_index utility with multi-attention module support |
| atom/model_ops/mamba_ops/*.py | Implemented Mamba SSM operations (state passing, chunk scan, BMM, cumsum) |
| atom/model_ops/fla_ops/*.py | Added FLA (Flash Linear Attention) operations including chunk-based delta rule |
| atom/model_ops/layernorm.py | Added RMSNormGated and GemmaRMSNorm implementations with SiLU activation |
| atom/model_ops/base_attention.py | Added LinearAttention module for GDN attention |
| atom/model_ops/attentions/gdn_attn.py | Implemented GDN attention backend and metadata builder |
| atom/model_ops/attention_gdn.py | Core GDN attention implementation with convolution and recurrence |
| atom/model_ops/attention_mha.py | Fixed ASM layout detection for mixed KV cache dimensions |
| atom/model_loader/loader.py | Added mamba v2 sharded weight loader and "mtp" parameter filtering |
| atom/model_engine/model_runner.py | Extended runner with GDN support, KV cache allocation for mixed architectures |
| atom/model_config/qwen3_next.py | Added complete Qwen3Next configuration class |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ChuanLi1101
left a comment
There was a problem hiding this comment.
Leave some comments for your reference.
d1b2c60 to
e0d77a9
Compare
|
Can we add a recipe like "https://github.com/ROCm/ATOM/blob/main/recipes/Qwen3-235b.md" in another PR to help others run this workload? |
9ea93d9 to
81a8b1c
Compare
|
LGTM |
4c32891 to
02a001b
Compare
Signed-off-by: ganyi <ygan@amd.com>
…cudagraph Signed-off-by: ganyi <ygan@amd.com>
4e0949c to
9bbd64c
Compare
9bbd64c to
cf3c248
Compare
📋 Motivation
This PR introduces support for the Qwen3-Next model (available at: https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Thinking/tree/main) within the ATOM framework.
The Qwen3-Next architecture introduces a novel Gated Delta Network (GDN) module, which requires specialized integration to enable efficient inference on ATOM.
🔧 Technical Implementation
To implement Qwen3-Next support in ATOM, we followed vLLM's approach and integrated three core components:
1. Model Architecture (models/qwen3_next.py)
Defines the fundamental structure of the Qwen3-Next model
Implements the model layers and GDN blocks specific to this architecture
2. Metadata Management (metadata/gdn_attn.py)
Adds GDN metadata definitions and construction methods in prefill and decode phase
Handles the specialized attention mechanisms required by the Gated Delta Network
3. Kernel Operations (ops/)
fla_ops: Implements Flash Attention operations optimized for GDNmamba_ops: Provides Mamba-style state space model operations for efficient sequence processing✅ Supported Features
Eager Mode Execution: Full support for standard inference
CUDA Graph Optimization: Enables graph capture for reduced kernel launch overhead and improved performance
🧪 Test Plan & Validation Strategy
Server:
Client:
Test Result
🔮 Future Enhancement Roadmap