[Triton] DS FP4/FP8 Triton fusion and GEMM optimization #119

k50112113 · 2026-01-09T02:14:52Z

This PR is co-authored by @k50112113, @omuhamma (#61) and @farlukas (#116)

This PR provides Triton fusion/GEMM optimizations for DS FP4 and FP8,
please use the following AITER branch for testing for now as some of the PRs are yet to be merged to AITER main:
https://github.com/ROCm/aiter/tree/shaoclee/atom_triton_tmp_0106

The required AITER PRs include:

To activate the optimizations on ATOM, the following env variables are required:

# for concurrency > 4, use AR + RMS_Quant + GEMM optimizations:
export ATOM_USE_TRITON_GEMM=1
# note: ATOM_ENABLE_DS_INPUT_RMSNORM_QUANT_FUSION is turned on automictically when ATOM_USE_TRITON_GEMM is on

# for concurrency = 4, use AR_RMS + Quant_GEMM optimizations:
export ATOM_USE_TRITON_GEMM=1
export ATOM_ENABLE_DS_INPUT_RMSNORM_QUANT_FUSION=0

The following command along with the above env var are used to derive e2e performance results:

# for DS FP8
python -m atom.entrypoints.openai_server \
    --model /data/deepseek-ai/DeepSeek-R1-0528/ \
    -tp 8 \
    --block-size 1 \
    --server-port 8989 2>&1 | tee server.out

# for DS FP4
export ATOM_USE_TRITON_MXFP4_BMM=1
export AMDGCN_USE_BUFFER_OPS=1
python -m atom.entrypoints.openai_server \
    --model /data/DeepSeek-R1-0528-MXFP4-Preview \
    -tp 8 \
    --block-size 16 \
    --kv_cache_dtype fp8 \
    --server-port 8989 \
    2>&1 | tee server.out

For client command:

MODEL=<DS FP4 or FP8 model paths>
ISL=3500
OSL=1500
PORT=8989
for CONC in 4 256 128 64 32 16 8; do
    RESULT_FILENAME=${ISL}_${OSL}_${CONC}
    python /root/ATOM/atom/benchmarks/benchmark_serving.py \
        --model=$MODEL --backend=vllm --base-url=http://localhost:$PORT \
        --dataset-name=random \
        --random-input-len=$ISL --random-output-len=$OSL \
        --random-range-ratio 1.0 \
        --num-prompts=$(( $CONC * 8 )) \
        --max-concurrency=$CONC \
        --request-rate=inf --ignore-eos \
        --save-result --percentile-metrics="ttft,tpot,itl,e2el" \
        --result-dir=./ --result-filename=$RESULT_FILENAME.json 2>&1 | tee -a ${RESULT_FILENAME}.log
done

DS FP8 performance comparisons and uplift

DS FP4 performance comparisons and uplift

Making the BMM use fp4 weights

… fused rms

Fused rms

…ON=1

…ABLE_RMSNORM_QUANT_FUSION

Enable DSR1 FP8 Optimizations

atom/model_ops/linear.py

ChuanLi1101

Overall LGTM, approved for benchmark testing.

ChuanLi1101 · 2026-01-10T04:46:47Z

atom/model_ops/linear.py

+        # For Triton FP8 Blockscale GEMM is mostly slower then AITER GEMM, we turn off Triton FP8 GEMM
+        # from aiter.ops.triton.gemm_a8w8_blockscale import gemm_a8w8_blockscale_preshuffle as gemm_a8w8_blockscale_bpreshuffle_triton
+    except:    
+        gemm_afp4wfp4_preshuffle = None


Suggestion: Use specific exceptions and add logging:

if use_triton_gemm():
try:
from aiter.ops.triton.gemm_afp4wfp4 import gemm_afp4wfp4_preshuffle
except ImportError as e:
logger.warning(f"Triton FP4 GEMM not available: {e}")
gemm_afp4wfp4_preshuffle = None

ChuanLi1101 · 2026-01-10T04:57:49Z

atom/model_ops/attention_mla.py

+    try:
+        from aiter.ops.triton.fused_gemm_afp4wfp4_split_cat import fused_gemm_afp4wfp4_preshuffle_split_cat
+        from aiter.ops.triton.fused_gemm_a8w8_blockscale_split_cat import fused_gemm_a8w8_blockscale_preshuffle_split_cat
+    except:


if use_triton_gemm():
try:
from aiter.ops.triton.fused_gemm_afp4wfp4_split_cat import fused_gemm_afp4wfp4_preshuffle_split_cat
from aiter.ops.triton.fused_gemm_a8w8_blockscale_split_cat import fused_gemm_a8w8_blockscale_preshuffle_split_cat
except ImportError as e:
logger.debug(f"Triton fused GEMM split_cat not available: {e}")
fused_gemm_afp4wfp4_preshuffle_split_cat = None
fused_gemm_a8w8_blockscale_preshuffle_split_cat = None

ChuanLi1101 · 2026-01-10T05:10:13Z

atom/models/deepseek_v2.py

+        from aiter.ops.triton.gemm_a8w8_blockscale import gemm_a8w8_blockscale_preshuffle
+        from aiter.ops.triton.gemm_a16w8_blockscale import gemm_a16w8_blockscale_preshuffle
+    except:
+        gemm_afp4wfp4_preshuffle = None


Add logger?
logger.warning(f"Triton GEMM kernels not available: {e}. Ensure AITER is up-to-date.")

ChuanLi1101 · 2026-01-10T05:12:08Z

atom/model_ops/linear.py

+                shuffle=(m >= 32),
+            )
+
+        if m >= 32:


Use module constant?

In both files, import or define:

from atom.models.deepseek_v2 import MXFP4_QUANT_BLOCK_SIZE

Then use:

if m >= MXFP4_QUANT_BLOCK_SIZE:
x_scale = x_scale.view(torch.uint8).view(x_scale.shape[0] // MXFP4_QUANT_BLOCK_SIZE, -1)

valarLip · 2026-01-10T07:47:00Z

atom/model_ops/utils.py

+    # return x.view(h, b, d // 2), x_scales.view(h, b, d // 32)
+
+
+def mxfp4_to_f32(x, is_threed):


duplicated with the one in aiter

k50112113 and others added 30 commits December 11, 2025 19:32

tmp

8741164

fix

28b9bc9

clean

1831a5d

Making the BMM use fp4 weights

fcf4daf

add ATOM_USE_TRITON_GEMM and a16wfp4 gemm for o_proj

0f08fae

Cleaning up the code and ensuring other weights wont crash

4058924

add import check for gemm_a16wfp4_preshuffle

3c79ba8

Merge branch 'shaoclee/ds_fp4_gemm' into omuhamma/bmm

dadb3f4

Merge pull request #44 from ROCm/omuhamma/bmm

7c4e8ae

Making the BMM use fp4 weights

clean

b6ee629

clean

2fd1842

disable FP4 triton GEMM on o_proj on DS FP4

53a2ca1

Fused rms for fp4

05c39f5

Adding the x_scale change in linear.py to choose when to quantize or not

8b575e8

Enabling the second fused rms before attention

352d916

Merge remote-tracking branch 'origin/main' into shaoclee/ds_fp4_gemm

49b8d14

Fixing issue where there was a shape mismatch when running the second…

1f623a3

… fused rms

Merge branch 'shaoclee/ds_fp4_gemm' into omuhamma/dsfp4-rms

c9e5280

Marking shuffle and shuffle padding as true temporarily always

bec4b20

Working implemenation of fused_rms for fp4

8848851

Formatting fixes

f3b8681

Fix syntax error

35e8338

Remove some commented code from the fp4 section

c14a5d8

Merge pull request #61 from ROCm/omuhamma/dsfp4-rms

98fdbb4

Fused rms

disable only AR + input layernorm with ATOM_ENABLE_RMSNORM_QUANT_FUSI…

dcff526

…ON=1

add _fuse_qkv_a_proj_reduce_rmsnorm_quant for DS FP4

04b1301

add gemm split + cat for DS FP4

ea52f43

Integreated fused rmsnorm + quant in decoder layer

612bd7e

No need to fuse post attention

e52e722

Refactored fusion condition

ec27b85

farlukas and others added 20 commits January 2, 2026 17:36

Added fp8 triton preshuffled gemm

a1f16ea

Fixed triton gemm condition

3d01e02

Added fused rmsnorm quant fp8 back in

d7e3e80

Added transpose_scale back to fp8 fake function

9dc331c

Remove duplicate env

20ac850

Implemented fp8 gemm preshuffled + split + cat

279d7dd

Merge remote-tracking branch 'origin/main' into shaoclee/ds_fp4_gemm

e93a053

add back triton fusk_rope_kv_cache

866a01f

consider both AR_RMS + Quant and AR + RMS_Quant condition via ATOM_EN…

e18974e

…ABLE_RMSNORM_QUANT_FUSION

Merge branch 'shaoclee/ds_fp4_gemm' into farlukas/dsfp8-fusedrmsnorm

53f00e3

Implemented fp8 fused reduce rms quant

678ca28

change boundary

3bf537c

Removed unreachable branch

b26f81f

Added transpose_scale to fused reduce rms quant

b46db44

fix

d9fd150

clean

bbd4198

Merge pull request #116 from ROCm/farlukas/dsfp8-fusedrmsnorm

55fafbe

Enable DSR1 FP8 Optimizations

add a16w8 preshuffle gemm

e09758f

clean

19e74a7

change fp8 gemm boundary

26994c2

k50112113 requested review from ChuanLi1101 and valarLip January 9, 2026 02:17

valarLip reviewed Jan 9, 2026

View reviewed changes

atom/model_ops/linear.py Outdated Show resolved Hide resolved

k50112113 added 2 commits January 9, 2026 15:24

triton fp8 gemm rename

357f3ba

Merge remote-tracking branch 'origin/main' into shaoclee/ds_fp4_gemm

b6e1b89

k50112113 requested a review from valarLip January 9, 2026 15:47

k50112113 added 2 commits January 9, 2026 16:55

remove loader change

5817f32

remove comments

788dc30

ChuanLi1101 approved these changes Jan 10, 2026

View reviewed changes

valarLip reviewed Jan 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Triton] DS FP4/FP8 Triton fusion and GEMM optimization #119

[Triton] DS FP4/FP8 Triton fusion and GEMM optimization #119

k50112113 commented Jan 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

ChuanLi1101 left a comment

Uh oh!

ChuanLi1101 Jan 10, 2026

Uh oh!

ChuanLi1101 Jan 10, 2026

Uh oh!

ChuanLi1101 Jan 10, 2026

Uh oh!

ChuanLi1101 Jan 10, 2026

Uh oh!

valarLip Jan 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

		# return x.view(h, b, d // 2), x_scales.view(h, b, d // 32)


		def mxfp4_to_f32(x, is_threed):

[Triton] DS FP4/FP8 Triton fusion and GEMM optimization #119

Are you sure you want to change the base?

[Triton] DS FP4/FP8 Triton fusion and GEMM optimization #119

Conversation

k50112113 commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ChuanLi1101 left a comment

Choose a reason for hiding this comment

Uh oh!

ChuanLi1101 Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

ChuanLi1101 Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

ChuanLi1101 Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

ChuanLi1101 Jan 10, 2026

Choose a reason for hiding this comment

In both files, import or define:

Then use:

Uh oh!

valarLip Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

k50112113 commented Jan 9, 2026 •

edited

Loading