[hipblaslt] Initial 1250 Support by bethune-bryant · Pull Request #6165 · ROCm/rocm-libraries

bethune-bryant · 2026-04-03T21:10:00Z

We want to keep this history, this PR should not be squashed.

Motivation

This PR adds initial AMD GFX1250 architecture support to hipBLASLt (and hipSPARSELt). GFX1250 introduces new hardware capabilities including Tensor Data Mover (TDM) instructions, scaled WMMA/SWMMAC matrix instructions, 1024 VGPR support, and native support for low-precision data types (FP4, FP6, BF6, FP8, BF8) with Microscaling (MX) formats. This work enables hipBLASLt to leverage these new hardware features for GEMM and sparse GEMM operations on GFX1250.

Technical Details

This is a large integration (310 commits, 646 files changed) spanning TensileLite codegen, rocisa ISA abstractions, hipBLASLt library/client, and hipSPARSELt. Key changes include:

Architecture Enablement

Register gfx1250 as a supported architecture in hipBLASLt and hipSPARSELt build systems and CMake configurations.

TensileLite Kernel Codegen (Tensile/)

Major updates to KernelWriterAssembly.py and KernelWriter.py for MX global read, local write/read, MFMA/WMMA codegen, tail loop handling, and edge-case support for new data types.
New TensorDataMover.py component implementing TDM (TensorLoadToLds) instruction support for efficient data movement.
Overhauled LocalRead.py and LraTileAssignment.py for MX scale data handling, FP4 32x16 WMMA instruction support, and sparse metadata local reads.
Extended Solution.py and Problem.py with MX-aware solution parameters, BPE calculations, and kernel argument setup.
XF32/TF32 emulation support on gfx1250 WMMA V3 (non-MFMA path).

rocisa ISA Layer (rocisa/)

Added new instruction definitions: TDM memory instructions, scaled WMMA (v_wmma_f32_16x16x128_f8f6f4, v_swmmac_*), true16 CVT instructions (v_cvt_f16_fp8, v_cvt_pk_f16_fp8, v_cvt_f32_bf16), and extended buffer/flat memory operations.
1024 VGPR support and s_set_vgpr_msb workaround for base layer.

New Data Type Headers (library/include/hipblaslt/)

Added host-side type definitions: hipblaslt_float4.h, hipblaslt_float6.h, hipblaslt_bfloat6.h, hipblaslt_e5m3.h, hipblaslt_e8.h.

MX (Microscaling) Format Support

Full code path for MX formats (MXF8, MXE8, MXF4, MXE5M3 with B16/B32 block sizes): global read of scale factors, LDS layout, local read, MFMA/WMMA integration, tail loops, and edge handling.
hipBLASLt client and benchmark support for MX scale types with new samples and gtest coverage.

Sparse GEMM (hipSPARSELt)

Enabled gfx1250 in hipSPARSELt with SWMMAC-based sparse kernels for FP16, BF16, I8, FP8, and BF8 data types.
Metadata layout support, 8-bit SPMM depth-U=256 kernels, and sparse-B tail loop fixes.

Library Logic YAMLs

~500 new pre-tuned library logic YAML files covering dense GEMM (HHS, BBS, F8, B8, F8B8, B8F8, I8, MX variants) and sparse GEMM configurations for gfx1250.

Bug Fixes

Numerous correctness fixes: LDS padding, waitcnt calculations, float-to-int BPE casts in assembly, SIA2/SIA3 scheduling, GSU swizzle, DTL FP16/FP8, SGPR global read offsets, and more.

Test Plan

hipblaslt-test: Updated matmul_gtest.yaml and smoke_gtest.yaml with gfx1250-specific test cases covering HHS, BBS, F8/B8 (with and without scales), FP4, FP6, BF6, I8, and MX format combinations.
XFP32 test coverage added to hipblaslt-test and tox for gfx1250.
hipSPARSELt: New gtests for sparse FP16, BF16, I8, FP8, BF8 with both sparse-A and sparse-B metadata layouts.
TensileLite tox test YAMLs updated to include/exclude gfx1250 where appropriate (e.g., skip KRingShift, skip F4-related tests for gfx1200).
rocisa unit tests extended with new instruction encoding validation.

Test Result

hipblaslt-test and hipblaslt-bench pass on gfx1250 for supported data type and MX format configurations.
hipSPARSELt sparse GEMM tests pass for FP16, BF16, I8, FP8, and BF8 on gfx1250.
TensileLite kernel generation and host validation pass for dense and sparse kernels across all newly added logic YAMLs.

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Will add back after debugging. Co-authored-by: menghung_amdeng <menghung@amd.com>

* Modify packing size to 16 instead of 32 * Missing to apply clang-format

…S transpose read Metadata. (#112) * Revert "Temporarily set lds size to 65536 of gfx1250 sparse test cases" This reverts commit 66b3971. * Handle the lds offset of Metadata when LDS buffer > 64K and enable LDS transpose read Metadata.

This change enable v_cvt_f32_bf16 and true16 syntax related compiler option.

This patch also rename original option: Hascvtf16_fp8 as Hascvtf16_fp8_sf32. Because the old Hascvtf16_fp8 is actually for v_cvt_scalef32_pk_f16_fp8.

This patch includes true16 modifier implementation and related compiler option.

This change modifies the relevant code, except in mixed mode scenarios.

This patch update test YAML for the following instructions: v_cvt_f32_f16, v_cvt_f16_f32, v_cvt_pk_f16_f32, v_cvt_f32_bf16, v_cvt_f32_fp8, v_cvt_f32_bf8, v_cvt_pk_f32_fp8, v_cvt_pk_f32_bf8, v_cvt_pk_fp8_f32, v_cvt_pk_bf8_f32

* Fix 8-bit sparse kernels by: 1. Adding i32-fp32 conversion and 2. fixing tail-loop dense b cndmask offset. Additionally update 16-bit yamls * Fix 8-bit datatype packing issue. * 1. Update sparse gfx1250 testing yamls. These are simplified to reduce tox testing time, but tested on local end. 2. More yamls will be added. * Code refine, fix yamls and add metadatalayout test cases * Add and revise test yamls. * Fix i8 conversion issue when using GSU > 1. Add a missing yaml. * Follow up - Refine i8 conversion issue when using GSU > 1. Add GSU2 parameter in yamls. * remove whitespace in line 370 in GlobalWriteBatch.py

Co-authored-by: George Tseng <george.tseng@amd.com>

Change the order of enum HIPBLASLT_MATMUL_MATRIX_SCALE_VEC16_UE8M0_EXT.

This reverts commit e5febecfc7fe086563625ee3e284aa617f53beda.

…er architectures

…tadata packing In the MIInputPerThUnroll==8 packing path, PackTemp's lifetime ends before the second group of packing operations begins, so PackTemp can be safely reused instead of requiring PackTemp+1. This aligns the code generation with the VGPR allocation logic which only reserves 1 VGPR for PackTemp. Also simplify the gfx1250 sparse PackTemp allocation condition by removing the redundant MIInputPerThreadMetadata>1 check, since gfx1250 only has MIInputPerThreadMetadata = 4 or 8.

Remove emulator parameter: ROCmAgentEnumeratorPath. Remove env parameters in tox.ini. Remove compile and emulator workarounds. Set default CpuThreads back to -1.

Fix incorrect results produced by XF32 emulation kernels on gfx1250. The original XF32 codegen was written against gfx950 MFMA assumptions that do not hold for gfx1250 WMMA V3: 1. LDS offsets used gfx950-specific hardcoded constants (4, 12). gfx1250 WMMA V3 needs a *2 unroll-stride formula shared with BF16/Half. Branch by ISA via calcGfx1250LdsOffset(). 2. Pack logic assumed vgprPerInput ≤ 8 (single 8-VGPR group). gfx1250 has vgprPerInput=16 (two groups), producing interleaved [HI_g0, LO_g0, HI_g1, LO_g1]. Add v_swap_b32 to rearrange into contiguous [HI_all, LO_all] expected by 3-pass WMMA. 3. WMMA src offset hardcoded "+2"/"+4" for vgprPerInput 4/8. Replace with dynamic vgprPerInputA // 2 to yield "+8" on gfx1250.

…on gfx1250 SIA3 scheduler interleaved pack and MAC instructions without respecting data dependencies in the XF32 multigroup path, causing v_swap_b32 to corrupt F32 values mid-packing, and WMMAs to consume partially-packed VGPRs. - Move v_swap_b32 rearrangement from MAC code into pack code so it stays ordered after all TF32_1/TF32_2 packing - Fix destVgpr aliasing for UseDirect32XEmulation local reads - Place all XF32 pack items before the first WMMA instead of distributing one chunk per MFMA slot

The tail loop K-masking logic was written against gfx950 MFMA geometry where vgprPerInput ≤ 8 and BF16 inputs are packed (2 elements/VGPR). gfx1250 WMMA V3 has vgprPerInput=16 and XF32 reads unpacked FP32 (1 element/VGPR), breaking two assumptions: 1. T0 VGPR addressing: gfx950 bk maps 1:1 to T0 slots. gfx1250 Direct32X allocates T0 at half capacity (8 slots for 16 elements), so raw bk overflows into wrong tensor's registers. Fix: adjustedBk = (bk // 8) * 4 + (bk % 4). 2. K-to-VGPR mapping: gfx950 packed BF16 gives contiguous {0-7, 16-23}. gfx1250 unpacked FP32 + numVecUnroll=2 interleaving gives {0-3, 8-11, 16-19, 24-27}, zeroing wrong VGPRs for K=5-11, 21-27. Fix: vgprPerSet0Group=1, multiplyBy /= numVecUnroll, absolute K offsets per group.

F32X emulation pack code performs destructive in-place VGPR conversion (FP32 → BF16 high/low), which is incompatible with ForceUnrollSubIter's sub-tiling that splits local reads and pack code across sub-iterations. This caused validation failures with ScheduleIterAlg=1, MIWaveTile=[4,4], and DepthU==MatrixInstK.

Hardcoded MIInputPerThread==8 assertion in LocalRead.py caused AssertionError on gfx1250 (MIInputPerThread=16). Parameterize TXInterleaveLayoutIdx, dynamically generate dsReadConvTable and convArray to support any MIInputPerThread value.

UseMFMAF32XEmulation was unconditionally enabled for all F32X kernels, causing gfx1250 (WMMA, no MFMA) to emit invalid v_wmma_f32_4x4x4_bf16 instructions. Gate the flag behind HasMFMA so WMMA architectures fall through to the cvt+sub path instead.

The blanket "(not UseF32XEmulation)" exemption skipped lrvwTile=1 forcing for all XF32 paths, but only MFMA-based XF32 (gfx950) handles lrvwTile > 1 correctly. On gfx1250 WMMA, lrvwTile=2 produced incorrect local reads. Refine the exemption: only UseMFMAF32XEmulation and CMS kernels may keep lrvwTile > 1; non-MFMA XF32 paths are now forced to lrvwTile=1.

Problem: NT/TN format XF32 kernels produce inf/nan errors when TF32EmuInterleaveTreg is enabled but doFullPackCodePrefetch is False (PLR=0). The TXInterleaveLayoutIdx() function assumes the full prefetch pack code layout, which is incompatible with the non-prefetch register layout. Fix: Add conditional branching based on doFullPackCodePrefetch in the TF32EmuInterleaveTreg handling. For the non-prefetch path (PLR=0), use a simpler index calculation that maps the first half of each group (withinGroup < 4) to T registers with a straightforward index formula: idx = (idx // 8) * 4 + withinGroup. This matches the register layout expected by the pack code when doFullPackCodePrefetch is False.

Problem: TT and NN format kernels with DepthU=32 failed validation for tail loop. Root cause: In macroAndSetF32XEmuTregSingle(), the T registers (vgprValuA_T0_I0, vgprValuB_T0_I0) were defined using symbolic references relative to vgprValu{A/B}_X0_I0_BASE: .set vgprValuB_T0_I0, vgprValuB_X0_I0_BASE + 56 In the main loop, vgprValuB_X0_I0_BASE=34 gives T0=90 (correct). In the tail loop, vgprValuB_X0_I0_BASE is redefined to 32, giving T0=88 which overlaps with vgprValuA_T0_I0+6 (82+6=88). This causes A's TF32 processing to corrupt B's T registers (v88-v89), leading to incorrect WMMA results. Fix: Use absolute startVgprCvt values instead of symbolic BASE-relative offsets in RegSet. This ensures T register addresses remain correct regardless of BASE redefinition in the tail loop.

#1042) * Fix issues in cvt, enable gradient support for gfx1250. Cherry-picked from PR #160: - 35e2ef29dd (Fix issues in cvt and initial support on hhs gradient) - 80cce534da (Enable bbs gradient and postprocessing) * Fix wave32 and FP16 gradient issues on gfx1250 - Fix BF16 NaN check in writeBiasToGlobal for wave32 - Add fallback for FP16 sum unroll when dot2 is unavailable - Enable gfx1250 gtests for dgelu, bgrada, bgradb --------- Co-authored-by: George Tseng <george.tseng@amd.com> Co-authored-by: Andy Su <andysu12@amd.com>

Serge45 and others added 30 commits March 26, 2026 16:09

Update logic YAML version

7a9d79d

Fix compile error in hipBLASLt library and clients

0b88160

disable long branch, this waiting for ffm issue #1388 (#29)

f6b1be5

Fixed testing yaml for 1024 vgpr

e4f212c

Remove metadatalayout=1 + sparseB cases. (#79)

b27977e

Will add back after debugging. Co-authored-by: menghung_amdeng <menghung@amd.com>

Fixed incorrect local read padding calculation

2b0cbf1

Fixed gra address increments for swmmac metadata

4e88d15

Removed outdated tailLoopGlobalRead codes caused by cherry-pick

1b08293

Add marks for tox yamls to skip gfx1250.

14bae2e

Modify Float6x32/BFloat6x32 to Float6x16/BFloat6x16 in Tensile (#18)

fc2a3f1

* Modify packing size to 16 instead of 32 * Missing to apply clang-format

Temporarily set lds size to 65536 of gfx1250 sparse test cases

777a665

Fix f8b8 with f8f6f4 instructions. (#111)

4f8ab91

Enable B8/F8 use f8f6f4 instruction

aa7faa9

MX : init and host validation

f84742c

Enable v_cvt_f32_bf16

658521f

This change enable v_cvt_f32_bf16 and true16 syntax related compiler option.

Fix compilation error of undeclared vgprValuB_X0_I0_D0

3fe657c

Enable v_cvt_f16_fp8 and add v_cvt_pk_f16_fp8

24eab92

This patch also rename original option: Hascvtf16_fp8 as Hascvtf16_fp8_sf32. Because the old Hascvtf16_fp8 is actually for v_cvt_scalef32_pk_f16_fp8.

true16 modifier implementation

1638683

This patch includes true16 modifier implementation and related compiler option.

Optimize FP32 to FP16 with pack data by v_cvt_pk_f16_f32

a5ffedb

Make v_cvt_f32_f16 and v_cvt_f16_f32 support true16 syntax

09aad83

This change modifies the relevant code, except in mixed mode scenarios.

Support VCvtPkFP8toF32 and VCvtPkBF8toF32 with VOP3 in loopSum

52a30e9

Use S_MOV_B32 to reset EXEC for 32-lane wavefronts

73f96e5

Update test YAML for data conversion

56e99de

This patch update test YAML for the following instructions: v_cvt_f32_f16, v_cvt_f16_f32, v_cvt_pk_f16_f32, v_cvt_f32_bf16, v_cvt_f32_fp8, v_cvt_f32_bf8, v_cvt_pk_f32_fp8, v_cvt_pk_f32_bf8, v_cvt_pk_fp8_f32, v_cvt_pk_bf8_f32

Minor modifications based on Copilot review recommendations

1827833

Fix v_cvt_f32_bf16 build error and reviewer's suggesstions

af16f54

Enable edge and tail loop. (#143)

da20b5c

Co-authored-by: George Tseng <george.tseng@amd.com>

use ShortBranchMaxLength for SCLongBranchVccnz

f403437

[hipBLASLt] extend the vgpr pool for the metatadata at OptNLL

d0e8034

Serge45 and others added 26 commits March 31, 2026 11:40

Add = default to Float6x16 default constructor

0f21c9d

Remove unused iType param from MXMFMAInstruction::typeConvert

b0bb118

Refactor scaleA/B type dispatch to switch for -Wswitch coverage

c912f50

Remove duplicated HasWMMA_f8f6f4 capability

f85bedc

Fix default cnt for SWaitTensorCnt to 0.

9ccf4a7

Remove datatype examples that are not currently supported. (#1017)

1481cbf

Change the order of enum HIPBLASLT_MATMUL_MATRIX_SCALE_VEC16_UE8M0_EXT.

Reapply "[Tensilelite][Sparse] Enable plr min for spmm (#4364)"

3c0ce07

This reverts commit e5febecfc7fe086563625ee3e284aa617f53beda.

[hipSPARSELt] Fix packItemsM logic for gfx1250 to be aligned with oth…

e609df6

…er architectures

Fix incorrect function name of computeInputType{A, B} calls

6b43248

Fix missing mxsb when rebasing

2fab6bc

Revert modifications for emulator (#1001)

223a053

Remove emulator parameter: ROCmAgentEnumeratorPath. Remove env parameters in tox.ini. Remove compile and emulator workarounds. Set default CpuThreads back to -1.

Enable HasF32XEmulation in gfx1250

9ee61e9

Enable XFP32 test coverage for gfx1250 in hipblaslt-test and tox

09e79c8

Remove unused segmentsize (#1049)

df75817

include fp6 bf6 header

a7fab2b

github-actions bot added project: hipblaslt project: hipsparselt ci:hipsparselt-fast labels Apr 3, 2026

assistant-librarian bot added the organization: ROCm label Apr 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[hipblaslt] Initial 1250 Support#6165

[hipblaslt] Initial 1250 Support#6165
bethune-bryant wants to merge 310 commits intodevelopfrom
users/sergelu/develop-gfx1250-open-source

bethune-bryant commented Apr 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

Conversation

bethune-bryant commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

bethune-bryant commented Apr 3, 2026 •

edited

Loading