Add immediate offset support to buffer_load and buffer_store ops#328
Add immediate offset support to buffer_load and buffer_store ops#328
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates FlyDSL’s AMD buffer load/store wrappers to support an instruction immediate offset (ioffset) via inline asm when needed (working around LLVM folding limits), and aligns kernel/test call sites with the updated buffer op parameter names.
Changes:
- Add
ioffsetsupport tobuffer_load/buffer_store, using inline asm whenioffset != 0and the existing ROCDL path otherwise. - Rename/adjust buffer op parameters (
soffset_bytes→soffset,offset→voffset) and update affected kernels/tests accordingly. - Add internal helpers for instruction selection and cache-modifier suffixes for the inline-asm path.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
python/flydsl/expr/buffer_ops.py |
Adds ioffset inline-asm lowering for buffer load/store; renames params; adds instruction-selection helpers. |
kernels/softmax_kernel.py |
Updates calls to use soffset= keyword. |
kernels/rmsnorm_kernel.py |
Updates calls to use soffset= keyword. |
kernels/layernorm_kernel.py |
Updates calls to use soffset= keyword. |
tests/kernels/test_quant.py |
Updates calls to use soffset= keyword. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def _get_type_bytes(mlir_type) -> int: | ||
| """Return total byte width of an MLIR scalar or vector type.""" | ||
| if hasattr(mlir_type, 'element_type'): # VectorType | ||
| return mlir_type.shape[0] * (mlir_type.element_type.width // 8) | ||
| return mlir_type.width // 8 |
There was a problem hiding this comment.
_get_type_bytes() computes vector size as shape[0] * element_width, which is incorrect for multi-dimensional vectors (it should use the product of all dimensions). This can lead to wrong instruction selection (or unexpected ValueErrors) when ioffset!=0 and data.type/result_type is a ranked vector. Consider using math.prod(mlir_type.shape) (or equivalent) instead of shape[0].
| # IMPORTANT: Buffer load offset is in BYTES, not elements! | ||
| # For vec4xf32, each element is 4 bytes, so multiply offset by 4 | ||
| element_bytes = dtype.width // 8 | ||
| bytes_const = _create_i32_constant(element_bytes) | ||
| op = std_arith.MulIOp(offset, bytes_const) | ||
| offset = _unwrap_value(op.result) | ||
| op = std_arith.MulIOp(voffset, bytes_const) |
There was a problem hiding this comment.
buffer_load() multiplies voffset by element_bytes, so voffset is effectively an element index (in units of dtype) rather than a byte offset. This conflicts with the updated naming/docs that describe voffset as “offset in bytes” and can easily lead to incorrect addressing when callers pass byte offsets. Consider either requiring voffset in bytes (remove scaling) or clarifying the API contract (and/or adding an offset_is_bytes flag like buffer_store()).
| rsrc: Buffer resource descriptor (!llvm.ptr<8>) | ||
| offset: Offset in elements (i32 type) | ||
| voffset: VGPR offset in bytes or in element if offset_is_bytes is False | ||
| mask: Optional mask for predicated store (i1 type) | ||
| cache_modifier: Cache control flags (0 for default) | ||
|
|
||
| soffset: Optional SGPR offset in bytes added by the buffer instruction. | ||
| ioffset: 12-bit unsigned instruction immediate offset in bytes (0-4095). | ||
| When non-zero, emits inline asm to use the hardware immediate offset | ||
| field which is not exposed by the MLIR ROCDL dialect. | ||
| offset_is_bytes: If True, skip element-to-byte scaling on voffset. |
There was a problem hiding this comment.
buffer_store() docs for voffset/offset_is_bytes are ambiguous (“offset in bytes or in element if offset_is_bytes is False”). Since the implementation treats voffset as an element index unless offset_is_bytes=True, please reword this to clearly state the default unit and how offset_is_bytes changes behavior.
| if not 0 <= ioffset <= (1 << 12) - 1: | ||
| raise ValueError(f"ioffset must be a 12-bit unsigned value (0-4095), got {ioffset}") | ||
| if ioffset != 0: | ||
| total_bytes = _get_type_bytes(result_type) | ||
| load_inst = _BUFFER_LOAD_INST.get(total_bytes) | ||
| if load_inst is None: | ||
| raise ValueError(f"No buffer_load instruction for {total_bytes}-byte type") | ||
| asm = f"{load_inst} $0, $1, $2, $3 offen offset:{ioffset}{_cache_modifier_suffix(cache_modifier)}" | ||
| return llvm.InlineAsmOp( | ||
| res=result_type, | ||
| operands_=[voffset, rsrc, soffset_val], | ||
| asm_string=asm, | ||
| constraints="=v,v,s,s", | ||
| has_side_effects=True, | ||
| is_align_stack=False, | ||
| ).results[0] |
There was a problem hiding this comment.
The new ioffset!=0 inline-asm lowering in buffer_load() is a distinct codegen path (instruction selection, constraints, error handling) but there’s no test exercising it. Please add at least one kernel/unit test that uses a non-zero ioffset and validates correctness to guard against regressions across ROCm/LLVM versions.
| # Emit buffer store | ||
| if not 0 <= ioffset <= (1 << 12) - 1: | ||
| raise ValueError(f"ioffset must be a 12-bit unsigned value (0-4095), got {ioffset}") | ||
| if ioffset != 0: | ||
| total_bytes = _get_type_bytes(data.type) | ||
| store_inst = _BUFFER_STORE_INST.get(total_bytes) | ||
| if store_inst is None: | ||
| raise ValueError(f"No buffer_store instruction for {total_bytes}-byte type") | ||
| asm = f"{store_inst} $0, $1, $2, $3 offen offset:{ioffset}{_cache_modifier_suffix(cache_modifier)}" | ||
| llvm.InlineAsmOp( | ||
| res=None, | ||
| operands_=[data, voffset, rsrc, soffset_val], | ||
| asm_string=asm, | ||
| constraints="v,v,s,s", | ||
| has_side_effects=True, | ||
| is_align_stack=False, | ||
| ) |
There was a problem hiding this comment.
The new ioffset!=0 inline-asm lowering in buffer_store() is also untested. Please add coverage for non-zero ioffset stores (ideally including a masked store and at least one vector width) to ensure operand ordering/constraints remain correct.
|
Perhaps reporting it to the LLVM backend might be more reasonable. |
ATT. Inline assembly when ioffset isn't 0. Otherwise, use the old path.
LLVM's non-async
buffer_loadcannot support using 3 offsets (vgpr offset, sgpr offset and immediate/instruction offset) at the same time.According to my experiment. LLVM only folds immediate/instruction offset when soffset is 0. Consequently, with LLVM inst, only
voffset + soffsetorvoffset + ioffsetis availble.