Skip to content

NVFP4 Random Hadamard Transform (butterfly permutation-based)#509

Merged
matthiasdiener merged 49 commits intodevfrom
mdiener/fp4_hadamard
Apr 17, 2026
Merged

NVFP4 Random Hadamard Transform (butterfly permutation-based)#509
matthiasdiener merged 49 commits intodevfrom
mdiener/fp4_hadamard

Conversation

@matthiasdiener
Copy link
Copy Markdown
Contributor

@matthiasdiener matthiasdiener commented Mar 27, 2026

Description

Implements RHT via a butterfly permutation-based algorithm for NVFP4.

Has similar restrictions as upstream:

  • BF16 only
  • Transpose path only (no identity path)

Fixes https://github.com/ROCm/frameworks-internal/issues/15732

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@matthiasdiener matthiasdiener self-assigned this Mar 27, 2026
@matthiasdiener matthiasdiener added the ci-level 1 CI test level 1 label Mar 30, 2026
@matthiasdiener matthiasdiener changed the title [WIP] NVFP4 Hadamard NVFP4 Random Hadamard Transform (butterfly permutation-based) Mar 31, 2026
@matthiasdiener matthiasdiener requested a review from ipanfilo April 7, 2026 17:23
@matthiasdiener matthiasdiener requested a review from aris134 April 9, 2026 16:43
namespace transformer_engine {
namespace {

constexpr int kThreadsPerWarp = 32;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also seems unused on ROCm now so whole namespace could be guarded

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in cf2c8f6

Comment thread transformer_engine/pytorch/csrc/pybind.h
Comment thread tests/pytorch/nvfp4/test_nvfp4_rht_quantize_exact.py Outdated
Comment thread tests/pytorch/nvfp4/test_nvfp4_rht_quantize_exact.py Outdated
Comment thread tests/pytorch/nvfp4/test_nvfp4_rht_quantize_exact.py Outdated
Comment thread tests/pytorch/nvfp4/test_nvfp4_rht_quantize_exact.py Outdated
Comment thread transformer_engine/common/hadamard_transform/hadamard_transform.cu
Comment thread transformer_engine/common/hadamard_transform/hadamard_transform.cu
Comment thread transformer_engine/common/hadamard_transform/hadamard_transform.cu
Comment thread transformer_engine/common/hadamard_transform/hadamard_transform.cu
Comment thread transformer_engine/common/hadamard_transform/hadamard_transform.cu Outdated
Comment thread transformer_engine/common/hadamard_transform/hadamard_transform.cu Outdated
Comment on lines +500 to +508
static constexpr int kHadamardDim = 16;
static constexpr int kWarpSize = 64;
static constexpr int kThreadsPerWHT = 4;
static constexpr int kElemsPerThread = 4;
static constexpr int kRowsPerWarp = kWarpSize / kThreadsPerWHT; // 16
static constexpr int kWarpsPerBlock = 4;
static constexpr int kRowsPerBlock = kRowsPerWarp * kWarpsPerBlock; // 64
static constexpr int kThreadsPerBlock = kWarpSize * kWarpsPerBlock; // 256
static constexpr float kHadamardScale = 0.25f;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These do not seem like arbitrary tuning knobs. Some comment describing the layout scheme here could be helpful

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I can help answer partial of your questions.
kHadamardDim is the dimension of the hadamard transform matrix. In this specific case, the hadamard transform matrix is of size 16x16.
And kHadamardScale is 1/sqrt(hadamard matrix dim)

For the tiling constants:
kWarpSize is the number of threads per warp (or wavefront in our amd platform)
kThreadsPerWHT is how many threads are needed for one 16-point hadamard transform. Here it's set to be 4, which means that each thread will manage 4 inputs
kRowsPerWarp is defined as kWarpSize/kThreadsPerWHT probably because Matthias assign one warp (64 threads) to deal with a 2D data block of size 16x16 at the same time. So one row of 16 input data can be handle by 4 threads

But regarding why those tiling parameters are chosen like this, I'm not quite sure either

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wangye805 for answering.

I've added a comment regarding these comments in cf2c8f6. These values aren't tuning knobs, they're determined by the problem structure. kThreadsPerWHT=4 follows from the Kronecker decomposition H16 = H4 x H4, where we reshape the 16-element vector into a 4×4 matrix with one column per thread. This means each thread hold 4 values and the cross-thread butterfly stages use ds_swizzle for the H4.

Given that and a 64-wide wavefront, the rest follows: kRowsPerWarp = 64/4 = 16 rows per wavefront, and kWarpsPerBlock = 4 gives 64 rows per block

Comment thread transformer_engine/common/hadamard_transform/hadamard_transform.cu
Comment thread transformer_engine/common/hadamard_transform/hadamard_transform.cu
block_lam=fmaxf(block_lam,__shfl_xor(block_lam,off));

if (lane_id == 0)
atomicMaxFloat(amax_out, block_lam);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems correct, but from a performance perspective, did you consider a hierarchical/two-pass reduction instead of atomically combining block-local amax values into global memory? Since it is only one atomic per block, I can see the simplicity argument, but I was curious about the tradeoff.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like you said, for this kernel, the atomic contention should be relatively small. The two-stage reduction requires a workspace allocation plus another kernel launch. We can revisit if profiling shows this as a bottleneck.

Comment thread transformer_engine/common/hadamard_transform/hadamard_transform.cu
@matthiasdiener matthiasdiener merged commit 55c411b into dev Apr 17, 2026
3 checks passed
@matthiasdiener matthiasdiener deleted the mdiener/fp4_hadamard branch April 17, 2026 19:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-level 1 CI test level 1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants