Skip to content

Conversation

@adedespirlet
Copy link
Contributor

This change sets the default behavior of the minimize_shared_allocs pass to False.
Currently this pass reduces peak LDS memory usage by combining distinct memref.alloc operations into a single allocation with views. However this prevents the backend from inferring aliasing information. As a result the backend conservatively inserts unnecessary wait count instructions (vmcnt(0)), which breaks software pipelining.
While this optimization is beneficial for kernels with disjoint buffer lifecycles (like extended attention) it causes significant performance regressions in standard pipelined kernels.
Let's disable it by default and enable it explicitly only for the specific cases where it is required.

@adedespirlet adedespirlet force-pushed the disable_minimize_shared_allocs branch from 7666153 to e3a0a76 Compare December 15, 2025 12:51
@ftynse
Copy link
Contributor

ftynse commented Dec 15, 2025

I suspect tests need to be updated, and we also want to keep it on in the options for extend attention and the likes.

We can also think about a heuristic approach where we compute the total LDS footprint required for the kernel and dynamically turn this pass on if it exceeds the available amount, but that requires some target information to be available and may be better done with a MLIR-based pass where arch information is easier to come by. So not block on this.

Signed-off-by: Tim Gymnich <[email protected]>
Comment on lines -124 to +126
# CHECK-COUNT-1: memref.alloc
# CHECK: scf.for %[[ARG3:.*]] = %[[C0]] to %[[C10]] step %[[C1]] {
# CHECK: scf.for %arg4 = %[[C0]] to %[[C4]] step %[[C1]]
# CHECK: memref.alloc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potential regression

Comment on lines +217 to +218
# CHECK: memref.alloc
# CHECK: scf.for %[[ARG5:.*]] = %[[C0]] to %[[C4]] step %[[C1]] iter_args(%[[ARG6:.*]] = %[[CST:.*]]) -> (vector<4xf32>) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potential regression

Comment on lines -309 to -332
# CHECK-DAG: %[[C1:.*]] = arith.constant 1 : index
# CHECK-DAG: %[[C4:.*]] = arith.constant 4 : index
# CHECK-DAG: %[[C0:.*]] = arith.constant 0 : index
# CHECK-DAG: %[[C10:.*]] = arith.constant 10 : index
# CHECK-DAG: %[[CST_0:.*]] = arith.constant dense<0.000000e+00> : vector<4xf32>
# CHECK-COUNT-1: memref.alloc
# CHECK: %[[CAST_INIT_B:.*]] = arith.index_cast
# CHECK: %[[WHILE:.*]] = scf.while (%[[ACC:.*]] = %[[CST_0]], %[[B:.*]] = %[[CAST_INIT_B]]) : (vector<4xf32>, index) -> (vector<4xf32>, index) {
# CHECK: %[[COND:.*]] = arith.cmpi slt, %[[B]], %[[C10]] : index
# CHECK: scf.condition(%[[COND]]) %[[ACC]], %[[B]] : vector<4xf32>, index
# CHECK: } do {
# CHECK: ^bb0(%[[ACC:.*]]: vector<4xf32>, %[[B:.*]]: index):
# CHECK: %[[FOR:.*]] = scf.for %[[ARG7:.*]] = %[[C0]] to %[[C4]] step %[[C1]] iter_args(%[[ARG8:.*]] = %[[CST_0]]) -> (vector<4xf32>) {
# CHECK-COUNT-1: vector.load
# CHECK: amdgpu.lds_barrier
# CHECK-COUNT-1: vector.store
# CHECK-COUNT-1: vector.load
# CHECK-COUNT-1: vector.store
# CHECK: amdgpu.lds_barrier
# CHECK-COUNT-2: vector.load
# CHECK: amdgpu.mfma
# CHECK: scf.yield
# CHECK-COUNT-4: vector.store

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI removed a bit too much here..

Signed-off-by: Tim Gymnich <[email protected]>
Signed-off-by: Tim Gymnich <[email protected]>
@ftynse
Copy link
Contributor

ftynse commented Dec 19, 2025

This is a prime example why we should do more unit-style tests and less integration tests. We should still have some integration tests, but avoid tight coupling in them: just check that key things happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants