Skip to content

[Triton MoE] Add optimized Gluon kernel for AMD CDNA3 with K-dimension unrolling#2277

Open
jwu10003 wants to merge 7 commits intoROCm:mainfrom
jwu10003:issue1638-moe
Open

[Triton MoE] Add optimized Gluon kernel for AMD CDNA3 with K-dimension unrolling#2277
jwu10003 wants to merge 7 commits intoROCm:mainfrom
jwu10003:issue1638-moe

Conversation

@jwu10003
Copy link
Contributor

@jwu10003 jwu10003 commented Mar 14, 2026

Motivation

Performance optimization request from Ali

  • topk: 8
  • expert_num: 128
  • gemm0: A = [8192, 4096], B = [128, 384, 4096], C = [65536, 384]
  • gemm1: A = [65536, 192], B = [128, 4096, 192], C = [8192, 8, 4096]
  • w8a8

Technical Details

  • Implements manual LICM since Gluon doesn't support automatic loop-invariant code motion
  • Pre-loads A matrix data outside the N-dimension loop to reduce memory traffic
  • Uses optimized BlockedLayout for efficient memory access patterns
  • Supports FP8 and INT8 quantization with per-tensor or block-wise scaling

Key Features

  • Layout optimization for efficient memory access and MFMA operations
  • K-dimension loop unrolling (up to 3 times) for small K sizes
  • Support for larger BLOCK_N tile sizes (up to 1024)
  • Manual loop-invariant code motion (LICM) for A matrix data loading
  • Efficient use of AMD CDNA3 MFMA instructions

Test Plan

pytest -v op_tests/triton_tests/moe/test_moe.py::test_fused_moe --junitxml=test-reports/triton.xml -s

Test Result

All pass before terminal shuts down from inactivity.

Performance Improvement

  • Test case: test_fused_moe[False-False-dtype0-True-False-True-8192-4096-192-8-128]
  • Kernel duration: 2.33ms -> 1.0ms (57% improvement)

Submission Checklist

@jwu10003 jwu10003 requested a review from a team March 14, 2026 07:25
@jwu10003 jwu10003 marked this pull request as draft March 14, 2026 07:25
@jwu10003 jwu10003 changed the title Opt MoE on MI308 for small K dimension size [Triton MoE] Add optimized Gluon kernel for AMD CDNA3 with K-dimension unrolling Mar 14, 2026
@jwu10003 jwu10003 marked this pull request as ready for review March 16, 2026 02:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants