Skip to content

Add immediate offset support to buffer_load and buffer_store ops#328

Open
ruanjm wants to merge 4 commits intomainfrom
jruan/ioffset_bufop
Open

Add immediate offset support to buffer_load and buffer_store ops#328
ruanjm wants to merge 4 commits intomainfrom
jruan/ioffset_bufop

Conversation

@ruanjm
Copy link
Copy Markdown
Contributor

@ruanjm ruanjm commented Apr 1, 2026

ATT. Inline assembly when ioffset isn't 0. Otherwise, use the old path.

LLVM's non-async buffer_load cannot support using 3 offsets (vgpr offset, sgpr offset and immediate/instruction offset) at the same time.

According to my experiment. LLVM only folds immediate/instruction offset when soffset is 0. Consequently, with LLVM inst, only voffset + soffset or voffset + ioffset is availble.

Copilot AI review requested due to automatic review settings April 1, 2026 08:35
@ruanjm ruanjm changed the title Jruan/ioffset bufop Add immediate offset support to buffer_load and buffer_store ops Apr 1, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates FlyDSL’s AMD buffer load/store wrappers to support an instruction immediate offset (ioffset) via inline asm when needed (working around LLVM folding limits), and aligns kernel/test call sites with the updated buffer op parameter names.

Changes:

  • Add ioffset support to buffer_load / buffer_store, using inline asm when ioffset != 0 and the existing ROCDL path otherwise.
  • Rename/adjust buffer op parameters (soffset_bytessoffset, offsetvoffset) and update affected kernels/tests accordingly.
  • Add internal helpers for instruction selection and cache-modifier suffixes for the inline-asm path.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
python/flydsl/expr/buffer_ops.py Adds ioffset inline-asm lowering for buffer load/store; renames params; adds instruction-selection helpers.
kernels/softmax_kernel.py Updates calls to use soffset= keyword.
kernels/rmsnorm_kernel.py Updates calls to use soffset= keyword.
kernels/layernorm_kernel.py Updates calls to use soffset= keyword.
tests/kernels/test_quant.py Updates calls to use soffset= keyword.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +374 to +378
def _get_type_bytes(mlir_type) -> int:
"""Return total byte width of an MLIR scalar or vector type."""
if hasattr(mlir_type, 'element_type'): # VectorType
return mlir_type.shape[0] * (mlir_type.element_type.width // 8)
return mlir_type.width // 8
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_get_type_bytes() computes vector size as shape[0] * element_width, which is incorrect for multi-dimensional vectors (it should use the product of all dimensions). This can lead to wrong instruction selection (or unexpected ValueErrors) when ioffset!=0 and data.type/result_type is a ranked vector. Consider using math.prod(mlir_type.shape) (or equivalent) instead of shape[0].

Copilot uses AI. Check for mistakes.
Comment on lines 446 to +450
# IMPORTANT: Buffer load offset is in BYTES, not elements!
# For vec4xf32, each element is 4 bytes, so multiply offset by 4
element_bytes = dtype.width // 8
bytes_const = _create_i32_constant(element_bytes)
op = std_arith.MulIOp(offset, bytes_const)
offset = _unwrap_value(op.result)
op = std_arith.MulIOp(voffset, bytes_const)
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buffer_load() multiplies voffset by element_bytes, so voffset is effectively an element index (in units of dtype) rather than a byte offset. This conflicts with the updated naming/docs that describe voffset as “offset in bytes” and can easily lead to incorrect addressing when callers pass byte offsets. Consider either requiring voffset in bytes (remove scaling) or clarifying the API contract (and/or adding an offset_is_bytes flag like buffer_store()).

Copilot uses AI. Check for mistakes.
Comment on lines 522 to +530
rsrc: Buffer resource descriptor (!llvm.ptr<8>)
offset: Offset in elements (i32 type)
voffset: VGPR offset in bytes or in element if offset_is_bytes is False
mask: Optional mask for predicated store (i1 type)
cache_modifier: Cache control flags (0 for default)

soffset: Optional SGPR offset in bytes added by the buffer instruction.
ioffset: 12-bit unsigned instruction immediate offset in bytes (0-4095).
When non-zero, emits inline asm to use the hardware immediate offset
field which is not exposed by the MLIR ROCDL dialect.
offset_is_bytes: If True, skip element-to-byte scaling on voffset.
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buffer_store() docs for voffset/offset_is_bytes are ambiguous (“offset in bytes or in element if offset_is_bytes is False”). Since the implementation treats voffset as an element index unless offset_is_bytes=True, please reword this to clearly state the default unit and how offset_is_bytes changes behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +479 to +494
if not 0 <= ioffset <= (1 << 12) - 1:
raise ValueError(f"ioffset must be a 12-bit unsigned value (0-4095), got {ioffset}")
if ioffset != 0:
total_bytes = _get_type_bytes(result_type)
load_inst = _BUFFER_LOAD_INST.get(total_bytes)
if load_inst is None:
raise ValueError(f"No buffer_load instruction for {total_bytes}-byte type")
asm = f"{load_inst} $0, $1, $2, $3 offen offset:{ioffset}{_cache_modifier_suffix(cache_modifier)}"
return llvm.InlineAsmOp(
res=result_type,
operands_=[voffset, rsrc, soffset_val],
asm_string=asm,
constraints="=v,v,s,s",
has_side_effects=True,
is_align_stack=False,
).results[0]
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new ioffset!=0 inline-asm lowering in buffer_load() is a distinct codegen path (instruction selection, constraints, error handling) but there’s no test exercising it. Please add at least one kernel/unit test that uses a non-zero ioffset and validates correctness to guard against regressions across ROCm/LLVM versions.

Copilot uses AI. Check for mistakes.
Comment on lines +586 to +602
# Emit buffer store
if not 0 <= ioffset <= (1 << 12) - 1:
raise ValueError(f"ioffset must be a 12-bit unsigned value (0-4095), got {ioffset}")
if ioffset != 0:
total_bytes = _get_type_bytes(data.type)
store_inst = _BUFFER_STORE_INST.get(total_bytes)
if store_inst is None:
raise ValueError(f"No buffer_store instruction for {total_bytes}-byte type")
asm = f"{store_inst} $0, $1, $2, $3 offen offset:{ioffset}{_cache_modifier_suffix(cache_modifier)}"
llvm.InlineAsmOp(
res=None,
operands_=[data, voffset, rsrc, soffset_val],
asm_string=asm,
constraints="v,v,s,s",
has_side_effects=True,
is_align_stack=False,
)
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new ioffset!=0 inline-asm lowering in buffer_store() is also untested. Please add coverage for non-zero ioffset stores (ideally including a masked store and at least one vector width) to ensure operand ordering/constraints remain correct.

Copilot uses AI. Check for mistakes.
@ruanjm
Copy link
Copy Markdown
Contributor Author

ruanjm commented Apr 1, 2026

Perhaps reporting it to the LLVM backend might be more reasonable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants