feat(turbomind): integrate cublasGemmGroupedBatchedEx for Qwen3.5 MoE inference on Blackwell GPUs with memory copy optimizations#4490
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds an SM100 (Blackwell) MoE GEMM backend for TurboMind by integrating a cuBLAS grouped batched GEMM path and adjusting build/runtime logic so SM100 builds can coexist with SM90 CUTLASS kernels while working around SM100-specific memcpy instability.
Changes:
- Add
cublasGemmGroupedBatchedEx-based grouped GEMM kernel for BF16/FP16 MoE on SM100, with workspace reuse and reduced per-launch overhead. - Update build/arch plumbing (SM100 arch, split SM90 kernels into a separate target, conditional registration/compilation defines).
- Adjust MoE weight/layout and copy behavior for SM100 (skip tiled conversion for grouped BF16/FP16, unfuse GatedSiLU, avoid
cuMemcpyBatchAsyncon SM100).
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/turbomind/models/llama/LlamaLinear.cu | Extends MoE gather to cover BF16/FP16 for unfused SM100 grouped path. |
| src/turbomind/models/llama/LlamaDenseWeight.cc | Disables fused GatedSiLU under SM100 grouped BF16/FP16 constraints. |
| src/turbomind/kernels/gemm/registry.h | Adds registry hook for SM100 grouped cuBLAS kernel. |
| src/turbomind/kernels/gemm/registry.cu | Conditionally registers SM90 kernels and SM100 grouped cuBLAS kernel. |
| src/turbomind/kernels/gemm/moe_utils_v2.h | Adds #pragma once. |
| src/turbomind/kernels/gemm/cublas.cu | Implements CublasGroupedKernel using cublasGemmGroupedBatchedEx. |
| src/turbomind/kernels/gemm/convert_v3.cu | Skips tiled conversion for SM100 grouped BF16/FP16 to match cuBLAS expectations. |
| src/turbomind/kernels/gemm/CMakeLists.txt | Splits SM90 kernels into gemm2_sm90, enables SM100 grouped cuBLAS via compile defs. |
| src/turbomind/kernels/gemm/arch.h | Adds Sm100 and updates compatibility ranges. |
| src/turbomind/core/copy.cc | Adds SM100 workaround to avoid cuMemcpyBatchAsync. |
| CMakeLists.txt | Adds 100a-real CUDA arch for B200 when CUDA ≥ 12.8. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // MoE gather: FP8 always. BF16/half when need_unfused_moe_gather (no idxs gather; e.g. SM100 grouped cuBLAS). | ||
| const bool need_unfused_moe_gather = | ||
| (int)A.shape(0) != m && dense.epilogue != Epilogue::kGatedSilu; | ||
| if (indices && (A.dtype() == kFloat8_e4m3 || need_unfused_moe_gather)) { |
There was a problem hiding this comment.
need_unfused_moe_gather can make this branch call invokeMoeDispatch() for non-FP8 dtypes (e.g. float32) when indices is set and dense.epilogue != kGatedSilu. invokeMoeDispatch() only supports 8-bit and 16-bit element sizes and will TM_CHECK(0) on other types, so this can become a runtime crash for float models/tests. Consider tightening the condition to only enable the unfused gather for kHalf/kBfloat16 (or byte_size(A.dtype()) == 2) in addition to FP8.
| if (indices && (A.dtype() == kFloat8_e4m3 || need_unfused_moe_gather)) { | |
| const bool supports_unfused_moe_gather = | |
| A.dtype() == kFloat8_e4m3 || A.dtype() == kHalf || A.dtype() == kBfloat16; | |
| if (indices && (A.dtype() == kFloat8_e4m3 || (need_unfused_moe_gather && supports_unfused_moe_gather))) { |
| // weight descriptor as Adesc; weight has no valid offsets -> Adesc.offsets=(nil) and Launch fails. | ||
| if (desc.group_axis != 0) { | ||
| return false; | ||
| } |
There was a problem hiding this comment.
CublasGroupedKernel::is_feasible() does not verify desc.order_a/order_b/order_c against the layout assumptions in Launch (row-major A & D, col-major interpreted weight). Without these checks, this kernel could be selected for grouped GEMMs with different operand orders and then compute incorrect results. Add explicit order checks (and/or reuse Kernel::is_feasible logic for orders) while still allowing both kHalf and kBfloat16.
| } | |
| } | |
| // Launch assumes row-major A and D/C, with B interpreted as column-major weight. | |
| if (desc.order_a != Order::kRowMajor || desc.order_b != Order::kColMajor | |
| || desc.order_c != Order::kRowMajor) { | |
| return false; | |
| } |
| if (weight_is_strided_ptrs) { | ||
| const uintptr_t kBadB = 0x320936400ULL; | ||
| if (B == nullptr || reinterpret_cast<uintptr_t>(B) == kBadB) { | ||
| fprintf(stderr, "[TM][GEMM] CublasGrouped: B null or bad (B=%p)\n", (void*)B); | ||
| return 1; | ||
| } | ||
| cudaPointerAttributes attr{}; | ||
| if (cudaPointerGetAttributes(&attr, B) != cudaSuccess || attr.type != cudaMemoryTypeDevice) { | ||
| fprintf(stderr, "[TM][GEMM] CublasGrouped: B not device ptr (attr.type=%d)\n", (int)attr.type); |
There was a problem hiding this comment.
The hard-coded “bad pointer” sentinel value 0x320936400ULL is unexplained and appears to be debugging residue. It can cause false positives and is difficult to maintain/justify as a correctness check. Prefer relying on cudaPointerGetAttributes (and/or other validated invariants) or gate this sentinel check behind a debug-only macro with a clear explanation of its origin.
| // Use pre-allocated workspace for device pointer arrays (no cudaMalloc/Free per call) | ||
| const size_t one_array = active_count * sizeof(void*); | ||
| const size_t total_bytes = 3 * one_array; | ||
| TM_CHECK_LE(total_bytes, workspace.tensormaps_size); |
There was a problem hiding this comment.
TM_CHECK_LE(total_bytes, workspace.tensormaps_size) will abort the process if the workspace is too small. Since this is a runtime property (depends on active_count), consider returning a nonzero error code with a clear message instead, so callers can fail gracefully (similar to the other argument validation in this function).
| TM_CHECK_LE(total_bytes, workspace.tensormaps_size); | |
| if (total_bytes > workspace.tensormaps_size) { | |
| fprintf(stderr, | |
| "[TM][GEMM] CublasGrouped: workspace.tensormaps_size too small " | |
| "(required=%zu, available=%zu, active_count=%d)\n", | |
| total_bytes, | |
| workspace.tensormaps_size, | |
| active_count); | |
| return 1; | |
| } |
| // cuMemcpyBatchAsync is known to crash on sm_100 (Blackwell); use sequential path. | ||
| int device = 0; | ||
| (void)cudaGetDevice(&device); | ||
| int major = 0; | ||
| (void)cudaDeviceGetAttribute(&major, cudaDevAttrComputeCapabilityMajor, device); | ||
| if (major >= 10) { | ||
| for (unsigned i = 0; i < src_.size(); ++i) { |
There was a problem hiding this comment.
This SM100 workaround queries cudaGetDevice + cudaDeviceGetAttribute(cudaDevAttrComputeCapabilityMajor, ...) on every BatchCopy::Run() call. BatchCopy can be used in tight loops, so this adds avoidable overhead. Consider caching the compute capability once (e.g. static local, or a member initialized in the constructor, with a per-device cache if needed).
| set(_has_sm100 FALSE) | ||
| set(_archs_100 "${CMAKE_CUDA_ARCHITECTURES}") | ||
| list(FILTER _archs_100 INCLUDE REGEX "^100") | ||
| if(_archs_100 AND CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL "12.5") | ||
| set(_has_sm100 TRUE) |
There was a problem hiding this comment.
_has_sm100 is set but never used. If it’s not needed, remove it; if it is meant to drive later logic, wire it up so the intent is clear (unused variables in CMake can hide configuration bugs).
| set(_has_sm100 FALSE) | |
| set(_archs_100 "${CMAKE_CUDA_ARCHITECTURES}") | |
| list(FILTER _archs_100 INCLUDE REGEX "^100") | |
| if(_archs_100 AND CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL "12.5") | |
| set(_has_sm100 TRUE) | |
| set(_archs_100 "${CMAKE_CUDA_ARCHITECTURES}") | |
| list(FILTER _archs_100 INCLUDE REGEX "^100") | |
| if(_archs_100 AND CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL "12.5") |
… inference on Blackwell GPUs with memory copy optimizations Co-authored-by: 2016202110015 <2016202110015@whu.edu.cn>
69d9d5e to
0d53c58
Compare
Motivation
TurboMind’s existing MoE path relies on CUTLASS-style fused/grouped kernels that target SM90. On NVIDIA Blackwell (SM100, e.g. B200), that path is not a drop-in replacement: building SM90 kernels for SM100 toolchains is problematic, and MoE inference needs a stable, vendor-supported grouped GEMM.
This PR adds a cuBLAS Grouped Batched GEMM path (
cublasGemmGroupedBatchedEx, CUDA 12.5+) for BF16/FP16 MoE FFN on SM100, so models such as Qwen3.5 MoE can run on Blackwell. It also reduces per-launch overhead in the grouped cuBLAS launcher (fewer synchronizations, no per-call device malloc for pointer arrays, reuse of pre-allocated workspace) and applies a safe fallback wherecuMemcpyBatchAsyncis known to misbehave on SM100.Goal: Correct and more efficient MoE inference on Blackwell without breaking existing architectures (H100 and below keep their current kernel selection).
Modification
Build / arch
100a-real(B200) when using CUDA ≥ 12.8.gemm2_sm90) compiled only for90/90a, so SM100-only builds remain valid while H100 compatibility is preserved when SM90 objects are still linked.ENABLE_CUBLAS_GROUPEDwhen targeting SM100 and CUDA ≥ 12.5; registerCublasGroupedKernelin the GEMM registry for arch ≥ 1000.cublas.cuCublasGroupedKernelwrappingcublasGemmGroupedBatchedExwith the documented row-major ↔ col-major mapping for MoE (ragged M per expert).workspace.tensormapsfor device-side A/B/C pointer tables (nocudaMallocAsyncper call), stream ordering instead of extra barriers where safe,cublasSetWorkspacefromworkspace.partials, single-pass construction of active groups.Weight / layout
convert_v3.cu: On SM100, skip tiled weight conversion for grouped BF16/FP16 so weights stay in the layout expected by grouped cuBLAS.LlamaDenseWeight.cc: On SM100 grouped path, disable fused GatedSiLU so activation runs outside the plain GEMM epilogue.LlamaLinear.cu: Extend MoE token gather to BF16/half when unfused grouped path is required (aligned with FP8 gather + scale dispatch behavior).Stability
copy.cc: On SM100+, avoidcuMemcpyBatchAsync(crash workaround); use sequentialcudaMemcpyAsyncvia existingcore::Copy, with cached compute-capability check to avoid querying the device everyRun().Misc
moe_utils_v2.h: Add#pragma once.arch.h: Add Sm100 and compatibility wiring.BC-breaking (Optional)
No intentional API or config break for Python users or existing TurboMind deployments.
gemm2_sm90automatically when the CMake logic enables it for H100 compatibility; artifact size may increase slightly for fat binaries.Downstream forks that patch MoE or GEMM registration should rebase carefully; others need no code changes.
Use cases (Optional)
lmdeploywhen built with CUDA 12.5+ and SM100 inCMAKE_CUDA_ARCHITECTURES.(Optional doc follow-up: mention Blackwell + grouped cuBLAS MoE in TurboMind build notes or supported-hardware table if the project maintains one.)
Checklist
pre-commit run --all-files(or project CI) before merge; fix any reported issues.cublasGemmGroupedBatchedExon SM100.