Skip to content

[Kernel Contribution] Atlas Inference DGX Spark NVFP4/FP8 CUDA kernels for Blackwell (GB10) SM121 architecture #866

@AzeezIsh

Description

@AzeezIsh

Hi @sayakpaul, following up on our exchange on X (https://x.com/AtlasInference/status/2055315705545863522).

We're building Atlas (https://github.com/Avarok-Cybersecurity/atlas), a pure Rust + CUDA inference engine currently targeting the NVIDIA DGX Spark (GB10, SM121/Blackwell). Atlas has no dependency chain, builds as a tiny 2.5GB container and cold starts under 2 minutes, compared to ~10 minutes typical of of other major providers on the same hardware. There's no PyTorch, no Python, no 200-dependency install dance. We support Qwen3.5/3.6 (MoE + dense), Qwen3-Coder-Next, Qwen3-VL, MiniMax M2.7, Gemma 4, Nemotron-3 Nano/Super, and Mistral Small 4. Speeds range from ~100 tok/s on Qwen3.6-35B-A3B-NVFP4. These are unprecedented speeds for the DGX Spark, see our recent SparkRun collaboration as the top inference service for GB10 which is starting to publicize these benches on SparkArena (attached below). We're open source, and happy to ship with one of the biggest distributers in the world :)

Image

The breadth of cutting-edge models we support means we've got all types of hyper-optimized kernels to offer.

  • NVFP4 fused MoE dispatch which has block-scaled FP4 expert routing + GEMM for hybrid MoE architectures (Qwen3-Coder-Next 80B, MiniMax M2.7 229B EP=2, Nemotron-3 Nano/Super)
  • MTP (Multi-Token Prediction) verify kernels with WY-chunkwise GDN kernels for K=2/3 speculative decoding across all supported models, one of the only production and functional MTP implementations for GB10
  • SSM/Mamba recurrent kernels with custom GDN (Gated Delta Network) kernels for hybrid SSM+Attention+MoE architectures
  • FP8 / BF16 paged attention, FlashInfer-based, validated across all 8 model families including MLA (Mistral) and MRoPE (Qwen3-VL)

ROCm roadmap: AMD just gifted us a Strix Halo, so ROCm ports are next alongside GB10 development. Contributing through HF Kernels would let the broader ecosystem benefit from both the CUDA and ROCm work, beyond Atlas itself which is our ultimate goal. Breakneck local speeds FOR the community.

We're happy to provide anything else on our end to help move this forward, looking forward to working together!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions