Skip to content

[hipblaslt] Initial 1250 Support#6165

Draft
bethune-bryant wants to merge 310 commits intodevelopfrom
users/sergelu/develop-gfx1250-open-source
Draft

[hipblaslt] Initial 1250 Support#6165
bethune-bryant wants to merge 310 commits intodevelopfrom
users/sergelu/develop-gfx1250-open-source

Conversation

@bethune-bryant
Copy link
Copy Markdown
Contributor

@bethune-bryant bethune-bryant commented Apr 3, 2026

We want to keep this history, this PR should not be squashed.

Motivation

This PR adds initial AMD GFX1250 architecture support to hipBLASLt (and hipSPARSELt). GFX1250 introduces new hardware capabilities including Tensor Data Mover (TDM) instructions, scaled WMMA/SWMMAC matrix instructions, 1024 VGPR support, and native support for low-precision data types (FP4, FP6, BF6, FP8, BF8) with Microscaling (MX) formats. This work enables hipBLASLt to leverage these new hardware features for GEMM and sparse GEMM operations on GFX1250.

Technical Details

This is a large integration (310 commits, 646 files changed) spanning TensileLite codegen, rocisa ISA abstractions, hipBLASLt library/client, and hipSPARSELt. Key changes include:

Architecture Enablement

  • Register gfx1250 as a supported architecture in hipBLASLt and hipSPARSELt build systems and CMake configurations.

TensileLite Kernel Codegen (Tensile/)

  • Major updates to KernelWriterAssembly.py and KernelWriter.py for MX global read, local write/read, MFMA/WMMA codegen, tail loop handling, and edge-case support for new data types.
  • New TensorDataMover.py component implementing TDM (TensorLoadToLds) instruction support for efficient data movement.
  • Overhauled LocalRead.py and LraTileAssignment.py for MX scale data handling, FP4 32x16 WMMA instruction support, and sparse metadata local reads.
  • Extended Solution.py and Problem.py with MX-aware solution parameters, BPE calculations, and kernel argument setup.
  • XF32/TF32 emulation support on gfx1250 WMMA V3 (non-MFMA path).

rocisa ISA Layer (rocisa/)

  • Added new instruction definitions: TDM memory instructions, scaled WMMA (v_wmma_f32_16x16x128_f8f6f4, v_swmmac_*), true16 CVT instructions (v_cvt_f16_fp8, v_cvt_pk_f16_fp8, v_cvt_f32_bf16), and extended buffer/flat memory operations.
  • 1024 VGPR support and s_set_vgpr_msb workaround for base layer.

New Data Type Headers (library/include/hipblaslt/)

  • Added host-side type definitions: hipblaslt_float4.h, hipblaslt_float6.h, hipblaslt_bfloat6.h, hipblaslt_e5m3.h, hipblaslt_e8.h.

MX (Microscaling) Format Support

  • Full code path for MX formats (MXF8, MXE8, MXF4, MXE5M3 with B16/B32 block sizes): global read of scale factors, LDS layout, local read, MFMA/WMMA integration, tail loops, and edge handling.
  • hipBLASLt client and benchmark support for MX scale types with new samples and gtest coverage.

Sparse GEMM (hipSPARSELt)

  • Enabled gfx1250 in hipSPARSELt with SWMMAC-based sparse kernels for FP16, BF16, I8, FP8, and BF8 data types.
  • Metadata layout support, 8-bit SPMM depth-U=256 kernels, and sparse-B tail loop fixes.

Library Logic YAMLs

  • ~500 new pre-tuned library logic YAML files covering dense GEMM (HHS, BBS, F8, B8, F8B8, B8F8, I8, MX variants) and sparse GEMM configurations for gfx1250.

Bug Fixes

  • Numerous correctness fixes: LDS padding, waitcnt calculations, float-to-int BPE casts in assembly, SIA2/SIA3 scheduling, GSU swizzle, DTL FP16/FP8, SGPR global read offsets, and more.

Test Plan

  • hipblaslt-test: Updated matmul_gtest.yaml and smoke_gtest.yaml with gfx1250-specific test cases covering HHS, BBS, F8/B8 (with and without scales), FP4, FP6, BF6, I8, and MX format combinations.
  • XFP32 test coverage added to hipblaslt-test and tox for gfx1250.
  • hipSPARSELt: New gtests for sparse FP16, BF16, I8, FP8, BF8 with both sparse-A and sparse-B metadata layouts.
  • TensileLite tox test YAMLs updated to include/exclude gfx1250 where appropriate (e.g., skip KRingShift, skip F4-related tests for gfx1200).
  • rocisa unit tests extended with new instruction encoding validation.

Test Result

  • hipblaslt-test and hipblaslt-bench pass on gfx1250 for supported data type and MX format configurations.
  • hipSPARSELt sparse GEMM tests pass for FP16, BF16, I8, FP8, and BF8 on gfx1250.
  • TensileLite kernel generation and host validation pass for dense and sparse kernels across all newly added logic YAMLs.

Submission Checklist

Serge45 and others added 30 commits March 26, 2026 16:09
Will add back after debugging.

Co-authored-by: menghung_amdeng <menghung@amd.com>
* Modify packing size to 16 instead of 32

* Missing to apply clang-format
…S transpose read Metadata. (#112)

* Revert "Temporarily set lds size to 65536 of gfx1250 sparse test cases"

This reverts commit 66b3971.

* Handle the lds offset of Metadata  when LDS buffer > 64K and enable LDS transpose read Metadata.
This change enable v_cvt_f32_bf16 and true16 syntax related compiler option.
This patch also rename original option: Hascvtf16_fp8 as Hascvtf16_fp8_sf32.
Because the old Hascvtf16_fp8 is actually for v_cvt_scalef32_pk_f16_fp8.
This patch includes true16 modifier implementation and related compiler
option.
This change modifies the relevant code, except in mixed mode scenarios.
This patch update test YAML for the following instructions:
v_cvt_f32_f16, v_cvt_f16_f32, v_cvt_pk_f16_f32, v_cvt_f32_bf16,
v_cvt_f32_fp8, v_cvt_f32_bf8, v_cvt_pk_f32_fp8, v_cvt_pk_f32_bf8,
v_cvt_pk_fp8_f32, v_cvt_pk_bf8_f32
* Fix 8-bit sparse kernels by: 1. Adding i32-fp32 conversion and 2. fixing tail-loop dense b cndmask offset. Additionally update 16-bit yamls

* Fix 8-bit datatype packing issue.

* 1. Update sparse gfx1250 testing yamls. These are simplified to reduce tox testing time, but tested on local end. 2. More yamls will be added.

* Code refine, fix yamls and add metadatalayout test cases

* Add and revise test yamls.

* Fix i8 conversion issue when using GSU > 1. Add a missing yaml.

* Follow up - Refine i8 conversion issue when using GSU > 1. Add GSU2 parameter in yamls.

* remove whitespace in line 370 in GlobalWriteBatch.py
Co-authored-by: George Tseng <george.tseng@amd.com>
Serge45 and others added 26 commits March 31, 2026 11:40
Change the order of enum HIPBLASLT_MATMUL_MATRIX_SCALE_VEC16_UE8M0_EXT.
This reverts commit e5febecfc7fe086563625ee3e284aa617f53beda.
…tadata packing

In the MIInputPerThUnroll==8 packing path, PackTemp's lifetime ends
before the second group of packing operations begins, so PackTemp can
be safely reused instead of requiring PackTemp+1. This aligns the code
generation with the VGPR allocation logic which only reserves 1 VGPR
for PackTemp. Also simplify the gfx1250 sparse PackTemp allocation
condition by removing the redundant MIInputPerThreadMetadata>1 check, since gfx1250 only has MIInputPerThreadMetadata = 4 or 8.
Remove emulator parameter: ROCmAgentEnumeratorPath.

Remove env parameters in tox.ini.

Remove compile and emulator workarounds.

Set default CpuThreads back to -1.
Fix incorrect results produced by XF32 emulation kernels on gfx1250.
The original XF32 codegen was written against gfx950 MFMA assumptions
that do not hold for gfx1250 WMMA V3:
1. LDS offsets used gfx950-specific hardcoded constants (4, 12).
   gfx1250 WMMA V3 needs a *2 unroll-stride formula shared with
   BF16/Half. Branch by ISA via calcGfx1250LdsOffset().
2. Pack logic assumed vgprPerInput ≤ 8 (single 8-VGPR group).
   gfx1250 has vgprPerInput=16 (two groups), producing interleaved
   [HI_g0, LO_g0, HI_g1, LO_g1]. Add v_swap_b32 to rearrange into
   contiguous [HI_all, LO_all] expected by 3-pass WMMA.
3. WMMA src offset hardcoded "+2"/"+4" for vgprPerInput 4/8.
   Replace with dynamic vgprPerInputA // 2 to yield "+8" on gfx1250.
…on gfx1250

SIA3 scheduler interleaved pack and MAC instructions without respecting
data dependencies in the XF32 multigroup path, causing v_swap_b32 to
corrupt F32 values mid-packing, and WMMAs to consume partially-packed
VGPRs.

- Move v_swap_b32 rearrangement from MAC code into pack code so it
  stays ordered after all TF32_1/TF32_2 packing
- Fix destVgpr aliasing for UseDirect32XEmulation local reads
- Place all XF32 pack items before the first WMMA instead of
  distributing one chunk per MFMA slot
The tail loop K-masking logic was written against gfx950 MFMA geometry
where vgprPerInput ≤ 8 and BF16 inputs are packed (2 elements/VGPR).
gfx1250 WMMA V3 has vgprPerInput=16 and XF32 reads unpacked FP32
(1 element/VGPR), breaking two assumptions:

1. T0 VGPR addressing: gfx950 bk maps 1:1 to T0 slots. gfx1250
   Direct32X allocates T0 at half capacity (8 slots for 16 elements),
   so raw bk overflows into wrong tensor's registers.
   Fix: adjustedBk = (bk // 8) * 4 + (bk % 4).
2. K-to-VGPR mapping: gfx950 packed BF16 gives contiguous {0-7, 16-23}.
   gfx1250 unpacked FP32 + numVecUnroll=2 interleaving gives
   {0-3, 8-11, 16-19, 24-27}, zeroing wrong VGPRs for K=5-11, 21-27.
   Fix: vgprPerSet0Group=1, multiplyBy /= numVecUnroll, absolute K
   offsets per group.
F32X emulation pack code performs destructive in-place VGPR conversion
(FP32 → BF16 high/low), which is incompatible with ForceUnrollSubIter's
sub-tiling that splits local reads and pack code across sub-iterations.
This caused validation failures with ScheduleIterAlg=1, MIWaveTile=[4,4],
and DepthU==MatrixInstK.
Hardcoded MIInputPerThread==8 assertion in LocalRead.py caused
AssertionError on gfx1250 (MIInputPerThread=16). Parameterize
TXInterleaveLayoutIdx, dynamically generate dsReadConvTable and
convArray to support any MIInputPerThread value.
UseMFMAF32XEmulation was unconditionally enabled for all F32X kernels,
causing gfx1250 (WMMA, no MFMA) to emit invalid v_wmma_f32_4x4x4_bf16
instructions. Gate the flag behind HasMFMA so WMMA architectures fall
through to the cvt+sub path instead.
The blanket "(not UseF32XEmulation)" exemption skipped lrvwTile=1 forcing for
all XF32 paths, but only MFMA-based XF32 (gfx950) handles lrvwTile > 1
correctly. On gfx1250 WMMA, lrvwTile=2 produced incorrect local reads.

Refine the exemption: only UseMFMAF32XEmulation and CMS kernels may keep
lrvwTile > 1; non-MFMA XF32 paths are now forced to lrvwTile=1.
Problem: NT/TN format XF32 kernels produce inf/nan errors when
TF32EmuInterleaveTreg is enabled but doFullPackCodePrefetch is False
(PLR=0). The TXInterleaveLayoutIdx() function assumes the full prefetch
pack code layout, which is incompatible with the non-prefetch register layout.

Fix: Add conditional branching based on doFullPackCodePrefetch in the
TF32EmuInterleaveTreg handling. For the non-prefetch path (PLR=0),
use a simpler index calculation that maps the first half of each group
(withinGroup < 4) to T registers with a straightforward index formula:
idx = (idx // 8) * 4 + withinGroup. This matches the register layout
expected by the pack code when doFullPackCodePrefetch is False.
Problem: TT and NN format kernels with DepthU=32 failed validation for tail loop.

Root cause: In macroAndSetF32XEmuTregSingle(), the T registers (vgprValuA_T0_I0,
vgprValuB_T0_I0) were defined using symbolic references relative to
vgprValu{A/B}_X0_I0_BASE:  .set vgprValuB_T0_I0, vgprValuB_X0_I0_BASE + 56
In the main loop, vgprValuB_X0_I0_BASE=34 gives T0=90 (correct).
In the tail loop, vgprValuB_X0_I0_BASE is redefined to 32, giving
T0=88 which overlaps with vgprValuA_T0_I0+6 (82+6=88).
This causes A's TF32 processing to corrupt B's T registers (v88-v89),
leading to incorrect WMMA results.

Fix: Use absolute startVgprCvt values instead of symbolic BASE-relative
offsets in RegSet. This ensures T register addresses remain correct
regardless of BASE redefinition in the tail loop.
#1042)

* Fix issues in cvt, enable gradient support for gfx1250.

Cherry-picked from PR #160:
- 35e2ef29dd (Fix issues in cvt and initial support on hhs gradient)
- 80cce534da (Enable bbs gradient and postprocessing)

* Fix wave32 and FP16 gradient issues on gfx1250
- Fix BF16 NaN check in writeBiasToGlobal for wave32
- Add fallback for FP16 sum unroll when dot2 is unavailable
- Enable gfx1250 gtests for dgelu, bgrada, bgradb

---------

Co-authored-by: George Tseng <george.tseng@amd.com>
Co-authored-by: Andy Su <andysu12@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.