Add immediate offset support to buffer_load and buffer_store ops#328

Open

ruanjm wants to merge 4 commits intomainfrom

jruan/ioffset_bufop

Contributor

ruanjm commented Apr 1, 2026

ATT. Inline assembly when ioffset isn't 0. Otherwise, use the old path.

LLVM's non-async buffer_load cannot support using 3 offsets (vgpr offset, sgpr offset and immediate/instruction offset) at the same time.

According to my experiment. LLVM only folds immediate/instruction offset when soffset is 0. Consequently, with LLVM inst, only voffset + soffset or voffset + ioffset is availble.

ruanjm added 4 commits

April 1, 2026 06:32


          Add immediate offset support to buffer_load and buffer_store ops

05fd7f5


          refine interface

742c1ef


          Update method used to choose instruction

162aa79


          add support to cache modifier

953ce90

Copilot AI review requested due to automatic review settings

April 1, 2026 08:35

ruanjm changed the title ~~Jruan/ioffset bufop~~ Add immediate offset support to buffer_load and buffer_store ops

Copilot AI reviewed

View reviewed changes

Contributor

Copilot AI left a comment

Pull request overview

This PR updates FlyDSL’s AMD buffer load/store wrappers to support an instruction immediate offset (ioffset) via inline asm when needed (working around LLVM folding limits), and aligns kernel/test call sites with the updated buffer op parameter names.

Changes:

Add ioffset support to buffer_load / buffer_store, using inline asm when ioffset != 0 and the existing ROCDL path otherwise.
Rename/adjust buffer op parameters (soffset_bytes → soffset, offset → voffset) and update affected kernels/tests accordingly.
Add internal helpers for instruction selection and cache-modifier suffixes for the inline-asm path.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`python/flydsl/expr/buffer_ops.py`	Adds `ioffset` inline-asm lowering for buffer load/store; renames params; adds instruction-selection helpers.
`kernels/softmax_kernel.py`	Updates calls to use `soffset=` keyword.
`kernels/rmsnorm_kernel.py`	Updates calls to use `soffset=` keyword.
`kernels/layernorm_kernel.py`	Updates calls to use `soffset=` keyword.
`tests/kernels/test_quant.py`	Updates calls to use `soffset=` keyword.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

python/flydsl/expr/buffer_ops.py

Comment on lines +374 to +378

+              def _get_type_bytes(mlir_type) -> int:
+                  """Return total byte width of an MLIR scalar or vector type."""
+                  if hasattr(mlir_type, 'element_type'):  # VectorType
+                      return mlir_type.shape[0] * (mlir_type.element_type.width // 8)
+                  return mlir_type.width // 8

Copilot AI Apr 1, 2026

_get_type_bytes() computes vector size as shape[0] * element_width, which is incorrect for multi-dimensional vectors (it should use the product of all dimensions). This can lead to wrong instruction selection (or unexpected ValueErrors) when ioffset!=0 and data.type/result_type is a ranked vector. Consider using math.prod(mlir_type.shape) (or equivalent) instead of shape[0].

Copilot uses AI. Check for mistakes.

python/flydsl/expr/buffer_ops.py

Comment on lines 446 to +450

    
                  # IMPORTANT: Buffer load offset is in BYTES, not elements!

                  # For vec4xf32, each element is 4 bytes, so multiply offset by 4

                  element_bytes = dtype.width // 8

                  bytes_const = _create_i32_constant(element_bytes)

                  op = std_arith.MulIOp(offset, bytes_const)

                  offset = _unwrap_value(op.result)

                  op = std_arith.MulIOp(voffset, bytes_const)

Copilot AI Apr 1, 2026

buffer_load() multiplies voffset by element_bytes, so voffset is effectively an element index (in units of dtype) rather than a byte offset. This conflicts with the updated naming/docs that describe voffset as “offset in bytes” and can easily lead to incorrect addressing when callers pass byte offsets. Consider either requiring voffset in bytes (remove scaling) or clarifying the API contract (and/or adding an offset_is_bytes flag like buffer_store()).

Copilot uses AI. Check for mistakes.

python/flydsl/expr/buffer_ops.py

Comment on lines 522 to +530

    
                      rsrc: Buffer resource descriptor (!llvm.ptr<8>)

                      offset: Offset in elements (i32 type)

                      voffset: VGPR offset in bytes or in element if offset_is_bytes is False

                      mask: Optional mask for predicated store (i1 type)

                      cache_modifier: Cache control flags (0 for default)

                      soffset: Optional SGPR offset in bytes added by the buffer instruction.

                      ioffset: 12-bit unsigned instruction immediate offset in bytes (0-4095).

                               When non-zero, emits inline asm to use the hardware immediate offset

                               field which is not exposed by the MLIR ROCDL dialect.

                      offset_is_bytes: If True, skip element-to-byte scaling on voffset.

Copilot AI Apr 1, 2026

buffer_store() docs for voffset/offset_is_bytes are ambiguous (“offset in bytes or in element if offset_is_bytes is False”). Since the implementation treats voffset as an element index unless offset_is_bytes=True, please reword this to clearly state the default unit and how offset_is_bytes changes behavior.

Copilot uses AI. Check for mistakes.

python/flydsl/expr/buffer_ops.py

Comment on lines +479 to +494

+                  if not 0 <= ioffset <= (1 << 12) - 1:
+                      raise ValueError(f"ioffset must be a 12-bit unsigned value (0-4095), got {ioffset}")
+                  if ioffset != 0:
+                      total_bytes = _get_type_bytes(result_type)
+                      load_inst = _BUFFER_LOAD_INST.get(total_bytes)
+                      if load_inst is None:
+                          raise ValueError(f"No buffer_load instruction for {total_bytes}-byte type")
+                      asm = f"{load_inst} $0, $1, $2, $3 offen offset:{ioffset}{_cache_modifier_suffix(cache_modifier)}"
+                      return llvm.InlineAsmOp(
+                          res=result_type,
+                          operands_=[voffset, rsrc, soffset_val],
+                          asm_string=asm,
+                          constraints="=v,v,s,s",
+                          has_side_effects=True,
+                          is_align_stack=False,
+                      ).results[0]

Copilot AI Apr 1, 2026

The new ioffset!=0 inline-asm lowering in buffer_load() is a distinct codegen path (instruction selection, constraints, error handling) but there’s no test exercising it. Please add at least one kernel/unit test that uses a non-zero ioffset and validates correctness to guard against regressions across ROCm/LLVM versions.

Copilot uses AI. Check for mistakes.

python/flydsl/expr/buffer_ops.py

Comment on lines +586 to +602

+                  # Emit buffer store
+                  if not 0 <= ioffset <= (1 << 12) - 1:
+                      raise ValueError(f"ioffset must be a 12-bit unsigned value (0-4095), got {ioffset}")
+                  if ioffset != 0:
+                      total_bytes = _get_type_bytes(data.type)
+                      store_inst = _BUFFER_STORE_INST.get(total_bytes)
+                      if store_inst is None:
+                          raise ValueError(f"No buffer_store instruction for {total_bytes}-byte type")
+                      asm = f"{store_inst} $0, $1, $2, $3 offen offset:{ioffset}{_cache_modifier_suffix(cache_modifier)}"
+                      llvm.InlineAsmOp(
+                          res=None,
+                          operands_=[data, voffset, rsrc, soffset_val],
+                          asm_string=asm,
+                          constraints="v,v,s,s",
+                          has_side_effects=True,
+                          is_align_stack=False,
+                      )

Copilot AI Apr 1, 2026

The new ioffset!=0 inline-asm lowering in buffer_store() is also untested. Please add coverage for non-zero ioffset stores (ideally including a masked store and at least one vector width) to ensure operand ordering/constraints remain correct.

Copilot uses AI. Check for mistakes.

Contributor Author

ruanjm commented Apr 1, 2026

Perhaps reporting it to the LLVM backend might be more reasonable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet