[Kernel Contribution] Atlas Inference DGX Spark NVFP4/FP8 CUDA kernels for Blackwell (GB10) SM121 architecture

Hi @sayakpaul, following up on our exchange on X (https://x.com/AtlasInference/status/2055315705545863522).

We're building Atlas (https://github.com/Avarok-Cybersecurity/atlas), a pure Rust + CUDA inference engine currently targeting the NVIDIA DGX Spark (GB10, SM121/Blackwell). Atlas has no dependency chain, builds as a tiny 2.5GB container and cold starts under 2 minutes, compared to ~10 minutes typical of of other major providers on the same hardware. There's no PyTorch, no Python, no 200-dependency install dance. We support Qwen3.5/3.6 (MoE + dense), Qwen3-Coder-Next, Qwen3-VL, MiniMax M2.7, Gemma 4, Nemotron-3 Nano/Super, and Mistral Small 4. Speeds range from ~100 tok/s on `Qwen3.6-35B-A3B-NVFP4`. These are unprecedented speeds for the DGX Spark, see our recent SparkRun collaboration as the top inference service for GB10 which is starting to publicize these benches on SparkArena (attached below). We're open source, and happy to ship with one of the biggest distributers in the world :)

<img width="2560" height="2368" alt="Image" src="https://github.com/user-attachments/assets/cb6c4270-9438-45f8-b7ed-0305792a705c" />

The breadth of cutting-edge models we support means we've got all types of hyper-optimized kernels to offer.
- NVFP4 fused MoE dispatch which has block-scaled FP4 expert routing + GEMM for hybrid MoE architectures (Qwen3-Coder-Next 80B, MiniMax M2.7 229B EP=2, Nemotron-3 Nano/Super)
- MTP (Multi-Token Prediction) verify kernels with WY-chunkwise GDN kernels for K=2/3 speculative decoding across all supported models, one of the only production and functional MTP implementations for GB10
- SSM/Mamba recurrent kernels with custom GDN (Gated Delta Network) kernels for hybrid SSM+Attention+MoE architectures
- FP8 / BF16 paged attention, FlashInfer-based, validated across all 8 model families including MLA (Mistral) and MRoPE (Qwen3-VL)

ROCm roadmap: **AMD** just gifted us a Strix Halo, so ROCm ports are next alongside GB10 development. Contributing through HF Kernels would let the broader ecosystem benefit from both the CUDA and ROCm work, beyond Atlas itself which is our ultimate goal. Breakneck local speeds FOR the community.

We're happy to provide anything else on our end to help move this forward, looking forward to working together!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel Contribution] Atlas Inference DGX Spark NVFP4/FP8 CUDA kernels for Blackwell (GB10) SM121 architecture #866

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Kernel Contribution] Atlas Inference DGX Spark NVFP4/FP8 CUDA kernels for Blackwell (GB10) SM121 architecture #866

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions