NVFP4 Random Hadamard Transform (butterfly permutation-based) by matthiasdiener · Pull Request #509 · ROCm/TransformerEngine

matthiasdiener · 2026-03-27T23:29:43Z

Description

Implements RHT via a butterfly permutation-based algorithm for NVFP4.

Has similar restrictions as upstream:

BF16 only
Transpose path only (no identity path)

Fixes https://github.com/ROCm/frameworks-internal/issues/15732

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Remove TODO regarding userbuffers

Userbuffer Enablement for ROCm

ipanfilo · 2026-04-09T21:55:28Z

 namespace transformer_engine {
 namespace {

 constexpr int kThreadsPerWarp = 32;


It also seems unused on ROCm now so whole namespace could be guarded

Done in cf2c8f6

aris134 · 2026-04-10T21:21:08Z

+static constexpr int kHadamardDim     = 16;
+static constexpr int kWarpSize        = 64;
+static constexpr int kThreadsPerWHT   = 4;
+static constexpr int kElemsPerThread  = 4;
+static constexpr int kRowsPerWarp     = kWarpSize / kThreadsPerWHT;   // 16
+static constexpr int kWarpsPerBlock   = 4;
+static constexpr int kRowsPerBlock    = kRowsPerWarp * kWarpsPerBlock; // 64
+static constexpr int kThreadsPerBlock = kWarpSize   * kWarpsPerBlock;  // 256
+static constexpr float kHadamardScale = 0.25f;


These do not seem like arbitrary tuning knobs. Some comment describing the layout scheme here could be helpful

I think I can help answer partial of your questions.
kHadamardDim is the dimension of the hadamard transform matrix. In this specific case, the hadamard transform matrix is of size 16x16.
And kHadamardScale is 1/sqrt(hadamard matrix dim)

For the tiling constants:
kWarpSize is the number of threads per warp (or wavefront in our amd platform)
kThreadsPerWHT is how many threads are needed for one 16-point hadamard transform. Here it's set to be 4, which means that each thread will manage 4 inputs
kRowsPerWarp is defined as kWarpSize/kThreadsPerWHT probably because Matthias assign one warp (64 threads) to deal with a 2D data block of size 16x16 at the same time. So one row of 16 input data can be handle by 4 threads

But regarding why those tiling parameters are chosen like this, I'm not quite sure either

Thanks @wangye805 for answering.

I've added a comment regarding these comments in cf2c8f6. These values aren't tuning knobs, they're determined by the problem structure. kThreadsPerWHT=4 follows from the Kronecker decomposition H16 = H4 x H4, where we reshape the 16-element vector into a 4×4 matrix with one column per thread. This means each thread hold 4 values and the cross-thread butterfly stages use ds_swizzle for the H4.

Given that and a 64-wide wavefront, the rest follows: kRowsPerWarp = 64/4 = 16 rows per wavefront, and kWarpsPerBlock = 4 gives 64 rows per block

aris134 · 2026-04-11T17:41:17Z

+        block_lam=fmaxf(block_lam,__shfl_xor(block_lam,off));
+
+      if (lane_id == 0)
+        atomicMaxFloat(amax_out, block_lam);


This seems correct, but from a performance perspective, did you consider a hierarchical/two-pass reduction instead of atomically combining block-local amax values into global memory? Since it is only one atomic per block, I can see the simplicity argument, but I was curious about the tradeoff.

Like you said, for this kernel, the atomic contention should be relatively small. The two-stage reduction requires a workspace allocation plus another kernel launch. We can revisit if profiling shows this as a bottleneck.

Micky774 and others added 18 commits March 27, 2026 09:27

Typo fix (#397)

d954c6d

ROCm UserBuffers for Comm Overlap

7b5cf20

Copyrights and cleanup

640f7e8

test guards

82faeec

Cleanup and RS flag race condition fix

b6a3ae4

Debugging midpoint

9e32d3a

Cleanup and workspace fix

84209ad

Guard layer registration in UB

c669bd2

Cleanup of profiling example for rocm

8040909

Readd example script and update custom_map

e375923

fix typo

c6bd974

MI300 test skips due to jittery results

d76aa06

Comment regarding sm_margin performance

ae979d0

Variable renamed, pybind fix, tolerance tightening

b58cbd1

Remove git conflict

e5d7446

Address style and hip/cu specific paths

7734ce5

HIP guards

c169c75

initial impl

80e0aab

matthiasdiener self-assigned this Mar 27, 2026

matthiasdiener added 2 commits March 27, 2026 18:30

Merge remote-tracking branch 'origin/dev' into mdiener/fp4_hadamard

de7863a

test update

bda7b13

matthiasdiener added the ci-level 1 CI test level 1 label Mar 30, 2026

alextmagro and others added 7 commits March 30, 2026 14:03

Update extensions.h

7ddb539

Remove TODO regarding userbuffers

amax opt

63c7a48

simplify

a260459

Merge pull request #367 from ROCm/userbuffer_epic

3dd8af9

Userbuffer Enablement for ROCm

Merge remote-tracking branch 'origin/dev' into mdiener/fp4_hadamard

ab217cb

simplify pt 2

26c5fb7

expand test

2087f24

matthiasdiener changed the title ~~[WIP] NVFP4 Hadamard~~ NVFP4 Random Hadamard Transform (butterfly permutation-based) Mar 31, 2026

matthiasdiener added 3 commits April 6, 2026 15:15

merge

b243b4c

Merge branch 'dev' into mdiener/fp4_hadamard

6527004

change to __builtin_bit_cast

ca1aacf

matthiasdiener requested a review from ipanfilo April 7, 2026 17:23

matthiasdiener and others added 2 commits April 8, 2026 10:04

remove copyright header

bc9f0a3

Merge remote-tracking branch 'origin/dev' into mdiener/fp4_hadamard

9f1851d

matthiasdiener requested a review from aris134 April 9, 2026 16:43

ipanfilo reviewed Apr 9, 2026

View reviewed changes

wangye805 requested changes Apr 10, 2026

View reviewed changes

aris134 reviewed Apr 10, 2026

View reviewed changes

aris134 reviewed Apr 11, 2026

View reviewed changes

Comment thread transformer_engine/common/hadamard_transform/hadamard_transform.cu

aris134 reviewed Apr 11, 2026

View reviewed changes

Comment thread transformer_engine/common/hadamard_transform/hadamard_transform.cu

aris134 reviewed Apr 11, 2026

View reviewed changes

Comment thread transformer_engine/common/hadamard_transform/hadamard_transform.cu

matthiasdiener added 5 commits April 13, 2026 12:29

Merge remote-tracking branch 'origin/dev' into mdiener/fp4_hadamard

739a20d

enable tests

f269097

Merge remote-tracking branch 'origin/dev' into mdiener/fp4_hadamard

346beb1

address reviewer comments

cf2c8f6

minor fixes

2772834

matthiasdiener force-pushed the mdiener/fp4_hadamard branch from bafafea to 2772834 Compare April 16, 2026 19:12

matthiasdiener requested review from aris134, ipanfilo and wangye805 April 16, 2026 19:15

PreRhtAmax optimizations

26c5cb1

wangye805 approved these changes Apr 17, 2026

View reviewed changes

aris134 approved these changes Apr 17, 2026

View reviewed changes

ipanfilo approved these changes Apr 17, 2026

View reviewed changes

matthiasdiener merged commit 55c411b into dev Apr 17, 2026
3 checks passed

matthiasdiener deleted the mdiener/fp4_hadamard branch April 17, 2026 19:01

Conversation

matthiasdiener commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

ipanfilo Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

matthiasdiener Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aris134 Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

wangye805 Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

matthiasdiener Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

aris134 Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

matthiasdiener Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

matthiasdiener commented Mar 27, 2026 •

edited

Loading