Draft
Conversation
Will add back after debugging. Co-authored-by: menghung_amdeng <menghung@amd.com>
* Modify packing size to 16 instead of 32 * Missing to apply clang-format
This change enable v_cvt_f32_bf16 and true16 syntax related compiler option.
This patch also rename original option: Hascvtf16_fp8 as Hascvtf16_fp8_sf32. Because the old Hascvtf16_fp8 is actually for v_cvt_scalef32_pk_f16_fp8.
This patch includes true16 modifier implementation and related compiler option.
This change modifies the relevant code, except in mixed mode scenarios.
This patch update test YAML for the following instructions: v_cvt_f32_f16, v_cvt_f16_f32, v_cvt_pk_f16_f32, v_cvt_f32_bf16, v_cvt_f32_fp8, v_cvt_f32_bf8, v_cvt_pk_f32_fp8, v_cvt_pk_f32_bf8, v_cvt_pk_fp8_f32, v_cvt_pk_bf8_f32
* Fix 8-bit sparse kernels by: 1. Adding i32-fp32 conversion and 2. fixing tail-loop dense b cndmask offset. Additionally update 16-bit yamls * Fix 8-bit datatype packing issue. * 1. Update sparse gfx1250 testing yamls. These are simplified to reduce tox testing time, but tested on local end. 2. More yamls will be added. * Code refine, fix yamls and add metadatalayout test cases * Add and revise test yamls. * Fix i8 conversion issue when using GSU > 1. Add a missing yaml. * Follow up - Refine i8 conversion issue when using GSU > 1. Add GSU2 parameter in yamls. * remove whitespace in line 370 in GlobalWriteBatch.py
Co-authored-by: George Tseng <george.tseng@amd.com>
Change the order of enum HIPBLASLT_MATMUL_MATRIX_SCALE_VEC16_UE8M0_EXT.
This reverts commit e5febecfc7fe086563625ee3e284aa617f53beda.
…tadata packing In the MIInputPerThUnroll==8 packing path, PackTemp's lifetime ends before the second group of packing operations begins, so PackTemp can be safely reused instead of requiring PackTemp+1. This aligns the code generation with the VGPR allocation logic which only reserves 1 VGPR for PackTemp. Also simplify the gfx1250 sparse PackTemp allocation condition by removing the redundant MIInputPerThreadMetadata>1 check, since gfx1250 only has MIInputPerThreadMetadata = 4 or 8.
Remove emulator parameter: ROCmAgentEnumeratorPath. Remove env parameters in tox.ini. Remove compile and emulator workarounds. Set default CpuThreads back to -1.
Fix incorrect results produced by XF32 emulation kernels on gfx1250. The original XF32 codegen was written against gfx950 MFMA assumptions that do not hold for gfx1250 WMMA V3: 1. LDS offsets used gfx950-specific hardcoded constants (4, 12). gfx1250 WMMA V3 needs a *2 unroll-stride formula shared with BF16/Half. Branch by ISA via calcGfx1250LdsOffset(). 2. Pack logic assumed vgprPerInput ≤ 8 (single 8-VGPR group). gfx1250 has vgprPerInput=16 (two groups), producing interleaved [HI_g0, LO_g0, HI_g1, LO_g1]. Add v_swap_b32 to rearrange into contiguous [HI_all, LO_all] expected by 3-pass WMMA. 3. WMMA src offset hardcoded "+2"/"+4" for vgprPerInput 4/8. Replace with dynamic vgprPerInputA // 2 to yield "+8" on gfx1250.
…on gfx1250 SIA3 scheduler interleaved pack and MAC instructions without respecting data dependencies in the XF32 multigroup path, causing v_swap_b32 to corrupt F32 values mid-packing, and WMMAs to consume partially-packed VGPRs. - Move v_swap_b32 rearrangement from MAC code into pack code so it stays ordered after all TF32_1/TF32_2 packing - Fix destVgpr aliasing for UseDirect32XEmulation local reads - Place all XF32 pack items before the first WMMA instead of distributing one chunk per MFMA slot
The tail loop K-masking logic was written against gfx950 MFMA geometry
where vgprPerInput ≤ 8 and BF16 inputs are packed (2 elements/VGPR).
gfx1250 WMMA V3 has vgprPerInput=16 and XF32 reads unpacked FP32
(1 element/VGPR), breaking two assumptions:
1. T0 VGPR addressing: gfx950 bk maps 1:1 to T0 slots. gfx1250
Direct32X allocates T0 at half capacity (8 slots for 16 elements),
so raw bk overflows into wrong tensor's registers.
Fix: adjustedBk = (bk // 8) * 4 + (bk % 4).
2. K-to-VGPR mapping: gfx950 packed BF16 gives contiguous {0-7, 16-23}.
gfx1250 unpacked FP32 + numVecUnroll=2 interleaving gives
{0-3, 8-11, 16-19, 24-27}, zeroing wrong VGPRs for K=5-11, 21-27.
Fix: vgprPerSet0Group=1, multiplyBy /= numVecUnroll, absolute K
offsets per group.
F32X emulation pack code performs destructive in-place VGPR conversion (FP32 → BF16 high/low), which is incompatible with ForceUnrollSubIter's sub-tiling that splits local reads and pack code across sub-iterations. This caused validation failures with ScheduleIterAlg=1, MIWaveTile=[4,4], and DepthU==MatrixInstK.
Hardcoded MIInputPerThread==8 assertion in LocalRead.py caused AssertionError on gfx1250 (MIInputPerThread=16). Parameterize TXInterleaveLayoutIdx, dynamically generate dsReadConvTable and convArray to support any MIInputPerThread value.
UseMFMAF32XEmulation was unconditionally enabled for all F32X kernels, causing gfx1250 (WMMA, no MFMA) to emit invalid v_wmma_f32_4x4x4_bf16 instructions. Gate the flag behind HasMFMA so WMMA architectures fall through to the cvt+sub path instead.
The blanket "(not UseF32XEmulation)" exemption skipped lrvwTile=1 forcing for all XF32 paths, but only MFMA-based XF32 (gfx950) handles lrvwTile > 1 correctly. On gfx1250 WMMA, lrvwTile=2 produced incorrect local reads. Refine the exemption: only UseMFMAF32XEmulation and CMS kernels may keep lrvwTile > 1; non-MFMA XF32 paths are now forced to lrvwTile=1.
Problem: NT/TN format XF32 kernels produce inf/nan errors when TF32EmuInterleaveTreg is enabled but doFullPackCodePrefetch is False (PLR=0). The TXInterleaveLayoutIdx() function assumes the full prefetch pack code layout, which is incompatible with the non-prefetch register layout. Fix: Add conditional branching based on doFullPackCodePrefetch in the TF32EmuInterleaveTreg handling. For the non-prefetch path (PLR=0), use a simpler index calculation that maps the first half of each group (withinGroup < 4) to T registers with a straightforward index formula: idx = (idx // 8) * 4 + withinGroup. This matches the register layout expected by the pack code when doFullPackCodePrefetch is False.
Problem: TT and NN format kernels with DepthU=32 failed validation for tail loop.
Root cause: In macroAndSetF32XEmuTregSingle(), the T registers (vgprValuA_T0_I0,
vgprValuB_T0_I0) were defined using symbolic references relative to
vgprValu{A/B}_X0_I0_BASE: .set vgprValuB_T0_I0, vgprValuB_X0_I0_BASE + 56
In the main loop, vgprValuB_X0_I0_BASE=34 gives T0=90 (correct).
In the tail loop, vgprValuB_X0_I0_BASE is redefined to 32, giving
T0=88 which overlaps with vgprValuA_T0_I0+6 (82+6=88).
This causes A's TF32 processing to corrupt B's T registers (v88-v89),
leading to incorrect WMMA results.
Fix: Use absolute startVgprCvt values instead of symbolic BASE-relative
offsets in RegSet. This ensures T register addresses remain correct
regardless of BASE redefinition in the tail loop.
#1042) * Fix issues in cvt, enable gradient support for gfx1250. Cherry-picked from PR #160: - 35e2ef29dd (Fix issues in cvt and initial support on hhs gradient) - 80cce534da (Enable bbs gradient and postprocessing) * Fix wave32 and FP16 gradient issues on gfx1250 - Fix BF16 NaN check in writeBiasToGlobal for wave32 - Add fallback for FP16 sum unroll when dot2 is unavailable - Enable gfx1250 gtests for dgelu, bgrada, bgradb --------- Co-authored-by: George Tseng <george.tseng@amd.com> Co-authored-by: Andy Su <andysu12@amd.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
This PR adds initial AMD GFX1250 architecture support to hipBLASLt (and hipSPARSELt). GFX1250 introduces new hardware capabilities including Tensor Data Mover (TDM) instructions, scaled WMMA/SWMMAC matrix instructions, 1024 VGPR support, and native support for low-precision data types (FP4, FP6, BF6, FP8, BF8) with Microscaling (MX) formats. This work enables hipBLASLt to leverage these new hardware features for GEMM and sparse GEMM operations on GFX1250.
Technical Details
This is a large integration (310 commits, 646 files changed) spanning TensileLite codegen, rocisa ISA abstractions, hipBLASLt library/client, and hipSPARSELt. Key changes include:
Architecture Enablement
TensileLite Kernel Codegen (Tensile/)
KernelWriterAssembly.pyandKernelWriter.pyfor MX global read, local write/read, MFMA/WMMA codegen, tail loop handling, and edge-case support for new data types.TensorDataMover.pycomponent implementing TDM (TensorLoadToLds) instruction support for efficient data movement.LocalRead.pyandLraTileAssignment.pyfor MX scale data handling, FP4 32x16 WMMA instruction support, and sparse metadata local reads.Solution.pyandProblem.pywith MX-aware solution parameters, BPE calculations, and kernel argument setup.rocisa ISA Layer (rocisa/)
v_wmma_f32_16x16x128_f8f6f4,v_swmmac_*), true16 CVT instructions (v_cvt_f16_fp8,v_cvt_pk_f16_fp8,v_cvt_f32_bf16), and extended buffer/flat memory operations.s_set_vgpr_msbworkaround for base layer.New Data Type Headers (library/include/hipblaslt/)
hipblaslt_float4.h,hipblaslt_float6.h,hipblaslt_bfloat6.h,hipblaslt_e5m3.h,hipblaslt_e8.h.MX (Microscaling) Format Support
Sparse GEMM (hipSPARSELt)
Library Logic YAMLs
Bug Fixes
Test Plan
matmul_gtest.yamlandsmoke_gtest.yamlwith gfx1250-specific test cases covering HHS, BBS, F8/B8 (with and without scales), FP4, FP6, BF6, I8, and MX format combinations.Test Result
Submission Checklist