Blockscaled Ragged Contiguous Grouped Gemm for MoEs #2790

Shreya-gaur · 2025-11-21T04:19:15Z

The PR introduces Blockscaled Ragged Contiguous Grouped GEMM for MoEs.

In MoEs, the weights have same dimensions across all experts and are stored contiguously in memory. The activations can have different dimensions across experts and may not be contiguous in memory either.

In this PR, I modified the grouped gemm kernel in CUTLASS to avoid tensormap update for weight matrix(matrix A) and it's corresponding scale(SFA). The weights are loaded using TMA. Note that the activation matrix and it's corresponding scale (SFB) still requires tensormap updates as the group, the kernel is working on, changes.

Shreya Gaur added 3 commits November 20, 2025 20:00

Adding blockscaled ragged contiguous grouped gemm for MoEs

abe2ca1

cleaning up the example

7ee6e5d

introduction to example improved

1856752

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Blockscaled Ragged Contiguous Grouped Gemm for MoEs #2790

Blockscaled Ragged Contiguous Grouped Gemm for MoEs #2790

Uh oh!

Shreya-gaur commented Nov 21, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Blockscaled Ragged Contiguous Grouped Gemm for MoEs #2790

Are you sure you want to change the base?

Blockscaled Ragged Contiguous Grouped Gemm for MoEs #2790

Uh oh!

Conversation

Shreya-gaur commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Shreya-gaur commented Nov 21, 2025 •

edited

Loading