feat(turbomind): integrate cublasGemmGroupedBatchedEx for Qwen3.5 MoE inference on Blackwell GPUs with memory copy optimizations by hd9568 · Pull Request #4490 · InternLM/lmdeploy

hd9568 · 2026-04-03T08:25:10Z

Motivation

TurboMind’s existing MoE path relies on CUTLASS-style fused/grouped kernels that target SM90. On NVIDIA Blackwell (SM100, e.g. B200), that path is not a drop-in replacement: building SM90 kernels for SM100 toolchains is problematic, and MoE inference needs a stable, vendor-supported grouped GEMM.

This PR adds a cuBLAS Grouped Batched GEMM path (cublasGemmGroupedBatchedEx, CUDA 12.5+) for BF16/FP16 MoE FFN on SM100, so models such as Qwen3.5 MoE can run on Blackwell. It also reduces per-launch overhead in the grouped cuBLAS launcher (fewer synchronizations, no per-call device malloc for pointer arrays, reuse of pre-allocated workspace) and applies a safe fallback where cuMemcpyBatchAsync is known to misbehave on SM100.

Goal: Correct and more efficient MoE inference on Blackwell without breaking existing architectures (H100 and below keep their current kernel selection).

Modification

Build / arch
- Add CUDA arch 100a-real (B200) when using CUDA ≥ 12.8.
- Split SM90 CUTLASS GEMM sources into a separate static target (gemm2_sm90) compiled only for 90/90a, so SM100-only builds remain valid while H100 compatibility is preserved when SM90 objects are still linked.
- Define ENABLE_CUBLAS_GROUPED when targeting SM100 and CUDA ≥ 12.5; register CublasGroupedKernel in the GEMM registry for arch ≥ 1000.
cublas.cu
- Implement CublasGroupedKernel wrapping cublasGemmGroupedBatchedEx with the documented row-major ↔ col-major mapping for MoE (ragged M per expert).
- Optimize Launch: merged D2H where needed, reuse workspace.tensormaps for device-side A/B/C pointer tables (no cudaMallocAsync per call), stream ordering instead of extra barriers where safe, cublasSetWorkspace from workspace.partials, single-pass construction of active groups.
Weight / layout
- convert_v3.cu: On SM100, skip tiled weight conversion for grouped BF16/FP16 so weights stay in the layout expected by grouped cuBLAS.
- LlamaDenseWeight.cc: On SM100 grouped path, disable fused GatedSiLU so activation runs outside the plain GEMM epilogue.
- LlamaLinear.cu: Extend MoE token gather to BF16/half when unfused grouped path is required (aligned with FP8 gather + scale dispatch behavior).
Stability
- copy.cc: On SM100+, avoid cuMemcpyBatchAsync (crash workaround); use sequential cudaMemcpyAsync via existing core::Copy, with cached compute-capability check to avoid querying the device every Run().
Misc
- moe_utils_v2.h: Add #pragma once.
- arch.h: Add Sm100 and compatibility wiring.

BC-breaking (Optional)

No intentional API or config break for Python users or existing TurboMind deployments.

Build: Projects that only built for SM100 without SM90 may now pull in gemm2_sm90 automatically when the CMake logic enables it for H100 compatibility; artifact size may increase slightly for fat binaries.
Runtime: Behavior change is scoped to SM100+ and grouped BF16/FP16 MoE: slightly different fusion boundary (GatedSiLU unfused) vs fused CUTLASS path on older GPUs. Numerics should remain in family with the existing “unfused activation” reference path; any strict bit-identical requirement across backends is not guaranteed.

Downstream forks that patch MoE or GEMM registration should rebase carefully; others need no code changes.

Use cases (Optional)

Deploy Qwen3.5 (or similar) MoE models on Blackwell (B200) with TurboMind / lmdeploy when built with CUDA 12.5+ and SM100 in CMAKE_CUDA_ARCHITECTURES.
Same codebase continues to serve H100 / A100 via existing CUTLASS + cuBLAS paths when SM90 kernels are enabled in the build.

(Optional doc follow-up: mention Blackwell + grouped cuBLAS MoE in TurboMind build notes or supported-hardware table if the project maintains one.)

Checklist

Pre-commit / lint: Please run pre-commit run --all-files (or project CI) before merge; fix any reported issues.
Tests: MoE + SM100 path is hardware-specific; full coverage in CI may not exist. If possible, add or extend a small regression test on a machine with B200, or document manual test steps (model + command). Otherwise state manual verification in the PR discussion.
Downstream versions: N/A beyond CUDA 12.5+ and a driver that supports cublasGemmGroupedBatchedEx on SM100.
Docs: Optional update to build / hardware docs for Blackwell MoE; not strictly required if maintainers prefer a follow-up PR.

Copilot

Pull request overview

This PR adds an SM100 (Blackwell) MoE GEMM backend for TurboMind by integrating a cuBLAS grouped batched GEMM path and adjusting build/runtime logic so SM100 builds can coexist with SM90 CUTLASS kernels while working around SM100-specific memcpy instability.

Changes:

Add cublasGemmGroupedBatchedEx-based grouped GEMM kernel for BF16/FP16 MoE on SM100, with workspace reuse and reduced per-launch overhead.
Update build/arch plumbing (SM100 arch, split SM90 kernels into a separate target, conditional registration/compilation defines).
Adjust MoE weight/layout and copy behavior for SM100 (skip tiled conversion for grouped BF16/FP16, unfuse GatedSiLU, avoid cuMemcpyBatchAsync on SM100).

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
src/turbomind/models/llama/LlamaLinear.cu	Extends MoE gather to cover BF16/FP16 for unfused SM100 grouped path.
src/turbomind/models/llama/LlamaDenseWeight.cc	Disables fused GatedSiLU under SM100 grouped BF16/FP16 constraints.
src/turbomind/kernels/gemm/registry.h	Adds registry hook for SM100 grouped cuBLAS kernel.
src/turbomind/kernels/gemm/registry.cu	Conditionally registers SM90 kernels and SM100 grouped cuBLAS kernel.
src/turbomind/kernels/gemm/moe_utils_v2.h	Adds `#pragma once`.
src/turbomind/kernels/gemm/cublas.cu	Implements `CublasGroupedKernel` using `cublasGemmGroupedBatchedEx`.
src/turbomind/kernels/gemm/convert_v3.cu	Skips tiled conversion for SM100 grouped BF16/FP16 to match cuBLAS expectations.
src/turbomind/kernels/gemm/CMakeLists.txt	Splits SM90 kernels into `gemm2_sm90`, enables SM100 grouped cuBLAS via compile defs.
src/turbomind/kernels/gemm/arch.h	Adds `Sm100` and updates compatibility ranges.
src/turbomind/core/copy.cc	Adds SM100 workaround to avoid `cuMemcpyBatchAsync`.
CMakeLists.txt	Adds `100a-real` CUDA arch for B200 when CUDA ≥ 12.8.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-03T09:08:20Z

src/turbomind/models/llama/LlamaLinear.cu

+        // MoE gather: FP8 always. BF16/half when need_unfused_moe_gather (no idxs gather; e.g. SM100 grouped cuBLAS).
+        const bool need_unfused_moe_gather =
+            (int)A.shape(0) != m && dense.epilogue != Epilogue::kGatedSilu;
+        if (indices && (A.dtype() == kFloat8_e4m3 || need_unfused_moe_gather)) {


need_unfused_moe_gather can make this branch call invokeMoeDispatch() for non-FP8 dtypes (e.g. float32) when indices is set and dense.epilogue != kGatedSilu. invokeMoeDispatch() only supports 8-bit and 16-bit element sizes and will TM_CHECK(0) on other types, so this can become a runtime crash for float models/tests. Consider tightening the condition to only enable the unfused gather for kHalf/kBfloat16 (or byte_size(A.dtype()) == 2) in addition to FP8.

Suggested change

if (indices && (A.dtype() == kFloat8_e4m3 || need_unfused_moe_gather)) {

const bool supports_unfused_moe_gather =

A.dtype() == kFloat8_e4m3 || A.dtype() == kHalf || A.dtype() == kBfloat16;

if (indices && (A.dtype() == kFloat8_e4m3 || (need_unfused_moe_gather && supports_unfused_moe_gather))) {

Copilot · 2026-04-03T09:08:20Z

src/turbomind/kernels/gemm/cublas.cu

+        // weight descriptor as Adesc; weight has no valid offsets -> Adesc.offsets=(nil) and Launch fails.
+        if (desc.group_axis != 0) {
+            return false;
+        }


CublasGroupedKernel::is_feasible() does not verify desc.order_a/order_b/order_c against the layout assumptions in Launch (row-major A & D, col-major interpreted weight). Without these checks, this kernel could be selected for grouped GEMMs with different operand orders and then compute incorrect results. Add explicit order checks (and/or reuse Kernel::is_feasible logic for orders) while still allowing both kHalf and kBfloat16.

Suggested change

}

}

// Launch assumes row-major A and D/C, with B interpreted as column-major weight.

if (desc.order_a != Order::kRowMajor || desc.order_b != Order::kColMajor

|| desc.order_c != Order::kRowMajor) {

return false;

}

Copilot · 2026-04-03T09:08:20Z

src/turbomind/kernels/gemm/cublas.cu

+        if (weight_is_strided_ptrs) {
+            const uintptr_t kBadB = 0x320936400ULL;
+            if (B == nullptr || reinterpret_cast<uintptr_t>(B) == kBadB) {
+                fprintf(stderr, "[TM][GEMM] CublasGrouped: B null or bad (B=%p)\n", (void*)B);
+                return 1;
+            }
+            cudaPointerAttributes attr{};
+            if (cudaPointerGetAttributes(&attr, B) != cudaSuccess || attr.type != cudaMemoryTypeDevice) {
+                fprintf(stderr, "[TM][GEMM] CublasGrouped: B not device ptr (attr.type=%d)\n", (int)attr.type);


The hard-coded “bad pointer” sentinel value 0x320936400ULL is unexplained and appears to be debugging residue. It can cause false positives and is difficult to maintain/justify as a correctness check. Prefer relying on cudaPointerGetAttributes (and/or other validated invariants) or gate this sentinel check behind a debug-only macro with a clear explanation of its origin.

Copilot · 2026-04-03T09:08:20Z

src/turbomind/kernels/gemm/cublas.cu

+        // Use pre-allocated workspace for device pointer arrays (no cudaMalloc/Free per call)
+        const size_t one_array   = active_count * sizeof(void*);
+        const size_t total_bytes = 3 * one_array;
+        TM_CHECK_LE(total_bytes, workspace.tensormaps_size);


TM_CHECK_LE(total_bytes, workspace.tensormaps_size) will abort the process if the workspace is too small. Since this is a runtime property (depends on active_count), consider returning a nonzero error code with a clear message instead, so callers can fail gracefully (similar to the other argument validation in this function).

Suggested change

TM_CHECK_LE(total_bytes, workspace.tensormaps_size);

if (total_bytes > workspace.tensormaps_size) {

fprintf(stderr,

"[TM][GEMM] CublasGrouped: workspace.tensormaps_size too small "

"(required=%zu, available=%zu, active_count=%d)\n",

total_bytes,

workspace.tensormaps_size,

active_count);

return 1;

}

Copilot · 2026-04-03T09:08:21Z

src/turbomind/core/copy.cc

+    // cuMemcpyBatchAsync is known to crash on sm_100 (Blackwell); use sequential path.
+    int device = 0;
+    (void)cudaGetDevice(&device);
+    int major = 0;
+    (void)cudaDeviceGetAttribute(&major, cudaDevAttrComputeCapabilityMajor, device);
+    if (major >= 10) {
+        for (unsigned i = 0; i < src_.size(); ++i) {


This SM100 workaround queries cudaGetDevice + cudaDeviceGetAttribute(cudaDevAttrComputeCapabilityMajor, ...) on every BatchCopy::Run() call. BatchCopy can be used in tight loops, so this adds avoidable overhead. Consider caching the compute capability once (e.g. static local, or a member initialized in the constructor, with a per-device cache if needed).

Copilot · 2026-04-03T09:08:21Z

src/turbomind/kernels/gemm/CMakeLists.txt

+set(_has_sm100 FALSE)
+set(_archs_100 "${CMAKE_CUDA_ARCHITECTURES}")
+list(FILTER _archs_100 INCLUDE REGEX "^100")
+if(_archs_100 AND CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL "12.5")
+    set(_has_sm100 TRUE)


_has_sm100 is set but never used. If it’s not needed, remove it; if it is meant to drive later logic, wire it up so the intent is clear (unused variables in CMake can hide configuration bugs).

Suggested change

set(_has_sm100 FALSE)

set(_archs_100 "${CMAKE_CUDA_ARCHITECTURES}")

list(FILTER _archs_100 INCLUDE REGEX "^100")

if(_archs_100 AND CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL "12.5")

set(_has_sm100 TRUE)

set(_archs_100 "${CMAKE_CUDA_ARCHITECTURES}")

list(FILTER _archs_100 INCLUDE REGEX "^100")

if(_archs_100 AND CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL "12.5")

… inference on Blackwell GPUs with memory copy optimizations Co-authored-by: 2016202110015 <2016202110015@whu.edu.cn>

hd9568 · 2026-04-03T09:32:21Z

@cassiewilliam

lvhan028 requested review from Copilot and lzhangzz April 3, 2026 08:57

Copilot started reviewing on behalf of lvhan028 April 3, 2026 08:58 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

feat(turbomind): integrate cublasGemmGroupedBatchedEx for Qwen3.5 MoE…

0d53c58

… inference on Blackwell GPUs with memory copy optimizations Co-authored-by: 2016202110015 <2016202110015@whu.edu.cn>

hd9568 force-pushed the feature/blackwell-moe-opt branch from 69d9d5e to 0d53c58 Compare April 3, 2026 09:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(turbomind): integrate cublasGemmGroupedBatchedEx for Qwen3.5 MoE inference on Blackwell GPUs with memory copy optimizations#4490

feat(turbomind): integrate cublasGemmGroupedBatchedEx for Qwen3.5 MoE inference on Blackwell GPUs with memory copy optimizations#4490
hd9568 wants to merge 1 commit intoInternLM:mainfrom
hd9568:feature/blackwell-moe-opt

hd9568 commented Apr 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

hd9568 commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-        if (indices && (A.dtype() == kFloat8_e4m3 || need_unfused_moe_gather)) {
+        const bool supports_unfused_moe_gather =
+            A.dtype() == kFloat8_e4m3 || A.dtype() == kHalf || A.dtype() == kBfloat16;
+        if (indices && (A.dtype() == kFloat8_e4m3 || (need_unfused_moe_gather && supports_unfused_moe_gather))) {

-        }
+        }
+        // Launch assumes row-major A and D/C, with B interpreted as column-major weight.
+        if (desc.order_a != Order::kRowMajor || desc.order_b != Order::kColMajor
+            || desc.order_c != Order::kRowMajor) {
+            return false;
+        }

-        TM_CHECK_LE(total_bytes, workspace.tensormaps_size);
+        if (total_bytes > workspace.tensormaps_size) {
+            fprintf(stderr,
+                    "[TM][GEMM] CublasGrouped: workspace.tensormaps_size too small "
+                    "(required=%zu, available=%zu, active_count=%d)\n",
+                    total_bytes,
+                    workspace.tensormaps_size,
+                    active_count);
+            return 1;
+        }

Conversation

hd9568 commented Apr 3, 2026

Motivation

Modification

BC-breaking (Optional)

Use cases (Optional)

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

hd9568 commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants