[Feat] Add FlyDSL MoE sorting kernel by amd-weisun · Pull Request #540 · ROCm/FlyDSL

amd-weisun · 2026-05-18T13:04:06Z

Drop-in replacement for OPUS/CK moe_sorting in aiter's fused_moe. (aiter uses OPUS by default)

Kernel paths:

T <= 16: single-block LDS histogram + DPP prefix sum
16 < T <= 2048: p0v2 + p23 — per-expert scatter + parallel prefix sum
T > 2048: 4-kernel — K1 clear + K2 scatter + K3 count + p23

Correctness: 32 CI tests + 14 large_shape (46 total), covering all 11 production MoE models (E=8..513, topk=2..9).

Kernel Performance (moe sorting only)

Decode (T≤256): CUDA graph capture + 200 replays.
Prefill (T>256): CUDA events (eager mode), 100 iterations.
MI350X (gfx950), single GPU.

DeepSeek-R1: E=257, topk=9, unit_size=32

     T    Path    CK (us)  OPUS (us)  FlyDSL (us)  vs CK    vs OPUS
     1   graph     13.5      13.5         9.7    +28.5%    +28.2%
     2   graph     13.5      13.5         9.8    +27.3%    +27.3%
     4   graph     13.9      13.9         9.8    +29.4%    +29.3%
     8   graph     14.0      14.0         9.8    +29.7%    +29.6%
    16   graph     17.8      17.8        10.7    +39.8%    +39.9%
    32   graph     13.5      13.6        13.5     -0.1%     +0.2%
    64   graph     13.6      13.5        13.5     +0.2%     -0.1%
   128   graph     13.6      13.6        13.5     +0.9%     +0.9%
   256   graph     13.9      13.8        13.8     +0.5%     -0.1%
   512  events     10.1      23.5        12.0    -18.3%    +49.1%
  1024  events     12.1      23.8        11.8     +2.6%    +50.5%
  2048  events     24.0      47.9        17.7    +26.1%    +63.0%
  4096  events     20.9      30.2        17.2    +17.8%    +43.1%
  8192  events     31.1      31.3        25.0    +19.7%    +20.2%
 16384  events     52.9      53.1        40.9    +22.6%    +22.8%

DeepSeek-V4: E=385, topk=7, unit_size=32

     T    Path    CK (us)  OPUS (us)  FlyDSL (us)  vs CK    vs OPUS
     1   graph     16.0      16.0         9.7    +39.4%    +39.3%
     2   graph     16.3      16.3         9.8    +39.7%    +39.7%
     4   graph     16.6      17.0         9.9    +40.5%    +41.8%
     8   graph     17.1      17.1         9.9    +42.1%    +42.1%
    16   graph     21.1      21.1        11.9    +43.9%    +43.8%
    32   graph     13.7      13.6        13.8     -1.4%     -1.5%
    64   graph     13.8      13.7        13.9     -0.9%     -1.7%
   128   graph     13.9      13.9        13.9     +0.3%     -0.1%
   256   graph     14.1      14.0        14.1     -0.2%     -0.6%
   512  events     13.1      24.4        12.6     +3.6%    +48.2%
  1024  events     12.4      24.6        12.7     -2.2%    +48.5%
  2048  events     17.7      30.1        14.6    +17.4%    +51.3%
  4096  events     21.1      30.2        17.3    +18.1%    +42.8%
  8192  events     30.5      30.6        26.9    +11.7%    +12.2%
 16384  events     49.7      50.3        43.6    +12.3%    +13.4%

Qwen3-MoE: E=128, topk=4, unit_size=128

     T    Path    CK (us)  OPUS (us)  FlyDSL (us)  vs CK    vs OPUS
     1   graph     10.9      10.9         9.7    +11.2%    +11.2%
     2   graph     11.3      11.2         9.7    +14.2%    +14.1%
     4   graph     11.5      11.4         9.7    +15.3%    +15.0%
     8   graph     11.5      11.5         9.7    +15.5%    +15.9%
    16   graph     13.8      13.9        10.4    +25.0%    +25.3%
    32   graph     17.3      17.3        12.9    +25.4%    +25.4%
    64   graph     12.8      12.8        12.9     -0.7%     -1.1%
   128   graph     12.8      12.8        12.9     -0.2%     -0.3%
   256   graph     12.9      13.0        12.9     -0.3%     +0.4%
   512  events     11.7      24.6        13.0    -11.0%    +47.2%
  1024  events     10.3      24.6        13.0    -26.6%    +47.1%
  2048  events     16.2      30.3        12.7    +21.7%    +58.1%
  4096  events     17.7      30.2        18.1     -2.3%    +40.1%
  8192  events     22.3      30.3        20.6     +7.7%    +31.9%
 16384  events     33.5      33.5        30.8     +8.3%     +8.1%

ATOM E2E Benchmark Result

DeepSeek-R1-0528 FP8, TP8, FP8 KV Cache, 8×MI350X
ISL=8192, OSL=1024, CONC=4, 40 prompts, 3 runs each

                          Baseline(OPUS)                            FlyDSL
Metric            Run1     Run2     Run3    Avg      Run1     Run2     Run3    Avg      Delta
─────────────────────────────────────────────────────────────────────────────────────────────
Output tok/s     329.09   329.37   328.99  329.15   336.36   336.75   337.00  336.70   +2.3%
Total tok/s     2958.34  2960.80  2957.46 2958.87  3023.69  3027.20  3029.42 3026.77   +2.3%
Mean TPOT (ms)    11.56    11.55    11.56   11.56    11.31    11.29    11.29   11.30   -2.2%
Median TPOT (ms)  11.62    11.60    11.62   11.61    11.36    11.34    11.33   11.34   -2.3%
P99 TPOT (ms)     12.00    11.99    12.00   12.00    11.76    11.74    11.72   11.74   -2.2%
Mean TTFT (ms)   277.28   279.32   278.18  278.26   276.19   277.24   275.73  276.39   -0.7%
Median TTFT (ms) 250.76   250.52   250.66  250.65   249.14   249.54   248.88  249.19   -0.6%
P99 TTFT (ms)    677.23   687.90   683.20  682.78   684.32   686.03   681.03  683.79   +0.1%
Mean ITL (ms)     11.55    11.54    11.55   11.55    11.30    11.28    11.27   11.28   -2.3%
Mean E2EL (ms) 10903.89 10897.20 10908.16 10903.08 10670.28 10655.88 10649.77 10658.64  -2.2%
P99 E2EL (ms)  12329.03 12330.07 12352.11 12337.07 12095.14 12091.76 12051.80 12079.57  -2.1%

Accuracy validated using llm accuracy validation steps in ATOM

Drop-in replacement for OPUS/CK moe_sorting in aiter's fused_moe. Kernel paths: - T <= 16: decode — single-block LDS histogram + DPP prefix sum - 16 < T <= 2048: p0v2 + p23 — per-expert scatter + parallel prefix sum - T > 2048: 4-kernel fused — K1 clear + K2 scatter + K3 count + p23 Correctness: 32 CI tests + 14 large_shape (46 total), covering all 11 production MoE models (E=8..513, topk=2..9).

Copilot

Pull request overview

Adds a new FlyDSL implementation of the MoE token sorting kernel (decode + prefill paths) intended as a drop-in replacement for aiter/OPUS/CK, along with a comprehensive GPU correctness test suite and optional benchmarking helpers.

Changes:

Introduce kernels/moe_sorting_kernel.py implementing decode (single-kernel LDS) and prefill (HBM workspace; p0v2+p23 or 4-kernel fused) sorting paths, including optional expert-mask (EP) support.
Add tests/kernels/test_moe_sorting.py validating outputs vs a Python reference and (optionally) aiter, plus EP-mode coverage and benchmark utilities.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
kernels/moe_sorting_kernel.py	New FlyDSL MoE sorting kernel implementation with decode/prefill dispatch, workspace handling, and EP support.
tests/kernels/test_moe_sorting.py	New GPU test suite comparing against a reference/aiter and covering multiple shapes + EP mode; includes optional benchmarking.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

amd-weisun force-pushed the moe_sorting branch 2 times, most recently from 4a16adf to 2e21d3c Compare May 18, 2026 13:44

amd-weisun force-pushed the moe_sorting branch from 2e21d3c to ba693c7 Compare May 18, 2026 13:48

fix CI failure

188d480

amd-weisun marked this pull request as ready for review May 18, 2026 16:00

Copilot AI review requested due to automatic review settings May 18, 2026 16:00

Copilot started reviewing on behalf of amd-weisun May 18, 2026 16:01 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

Comment thread kernels/moe_sorting_kernel.py Outdated

Comment thread kernels/moe_sorting_kernel.py Outdated

Comment thread kernels/moe_sorting_kernel.py

Comment thread tests/kernels/test_moe_sorting.py

amd-weisun added 2 commits May 18, 2026 17:14

address copilot comments

052c6ce

fix python format

f93f086

amd-weisun requested a review from Copilot May 18, 2026 16:23

Copilot started reviewing on behalf of amd-weisun May 18, 2026 16:24 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

Comment thread kernels/moe_sorting_kernel.py Outdated

Comment thread kernels/moe_sorting_kernel.py Outdated

Comment thread tests/kernels/test_moe_sorting.py Outdated

Comment thread tests/kernels/test_moe_sorting.py Outdated

Comment thread tests/kernels/test_moe_sorting.py Outdated

address copilot comments

4406da2

amd-weisun requested a review from coderfeli May 18, 2026 18:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] Add FlyDSL MoE sorting kernel#540

[Feat] Add FlyDSL MoE sorting kernel#540
amd-weisun wants to merge 5 commits into
ROCm:mainfrom
amd-weisun:moe_sorting

amd-weisun commented May 18, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

amd-weisun commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Kernel Performance (moe sorting only)

DeepSeek-R1: E=257, topk=9, unit_size=32

DeepSeek-V4: E=385, topk=7, unit_size=32

Qwen3-MoE: E=128, topk=4, unit_size=128

ATOM E2E Benchmark Result

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

amd-weisun commented May 18, 2026 •

edited

Loading