Disable minimize_shared_allocs pass by default #574

adedespirlet · 2025-12-15T11:22:52Z

This change sets the default behavior of the minimize_shared_allocs pass to False.
Currently this pass reduces peak LDS memory usage by combining distinct memref.alloc operations into a single allocation with views. However this prevents the backend from inferring aliasing information. As a result the backend conservatively inserts unnecessary wait count instructions (vmcnt(0)), which breaks software pipelining.
While this optimization is beneficial for kernels with disjoint buffer lifecycles (like extended attention) it causes significant performance regressions in standard pipelined kernels.
Let's disable it by default and enable it explicitly only for the specific cases where it is required.

Signed-off-by: Aurore De Spirlet <[email protected]>

ftynse · 2025-12-15T22:38:38Z

I suspect tests need to be updated, and we also want to keep it on in the options for extend attention and the likes.

We can also think about a heuristic approach where we compute the total LDS footprint required for the kernel and dynamically turn this pass on if it exceeds the available amount, but that requires some target information to be available and may be better done with a MLIR-based pass where arch information is easier to come by. So not block on this.

Signed-off-by: Tim Gymnich <[email protected]>

tgymnich · 2025-12-19T15:41:01Z

lit_tests/kernel/wave/iteration.py

-    # CHECK-COUNT-1:        memref.alloc
    # CHECK:                scf.for %[[ARG3:.*]] = %[[C0]] to %[[C10]] step %[[C1]] {
    # CHECK:                    scf.for %arg4 = %[[C0]] to %[[C4]] step %[[C1]]
+    # CHECK:                        memref.alloc


potential regression

tgymnich · 2025-12-19T15:41:53Z

lit_tests/kernel/wave/iteration.py

+    # CHECK:                    memref.alloc
+    # CHECK:                    scf.for %[[ARG5:.*]] = %[[C0]] to %[[C4]] step %[[C1]] iter_args(%[[ARG6:.*]] = %[[CST:.*]]) -> (vector<4xf32>) {


potential regression

tgymnich · 2025-12-19T15:42:15Z

lit_tests/kernel/wave/iteration.py

-    # CHECK-DAG:            %[[C1:.*]] = arith.constant 1 : index
-    # CHECK-DAG:            %[[C4:.*]] = arith.constant 4 : index
-    # CHECK-DAG:            %[[C0:.*]] = arith.constant 0 : index
-    # CHECK-DAG:            %[[C10:.*]] = arith.constant 10 : index
-    # CHECK-DAG:            %[[CST_0:.*]] = arith.constant dense<0.000000e+00> : vector<4xf32>
-    # CHECK-COUNT-1:        memref.alloc
-    # CHECK:                %[[CAST_INIT_B:.*]] = arith.index_cast
-    # CHECK:                %[[WHILE:.*]] = scf.while (%[[ACC:.*]] = %[[CST_0]], %[[B:.*]] = %[[CAST_INIT_B]]) : (vector<4xf32>, index) -> (vector<4xf32>, index) {
-    # CHECK:                    %[[COND:.*]] = arith.cmpi slt, %[[B]], %[[C10]] : index
-    # CHECK:                    scf.condition(%[[COND]]) %[[ACC]], %[[B]] : vector<4xf32>, index
-    # CHECK:                } do {
-    # CHECK:                ^bb0(%[[ACC:.*]]: vector<4xf32>, %[[B:.*]]: index):
-    # CHECK:                    %[[FOR:.*]] = scf.for %[[ARG7:.*]] = %[[C0]] to %[[C4]] step %[[C1]] iter_args(%[[ARG8:.*]] = %[[CST_0]]) -> (vector<4xf32>) {
-    # CHECK-COUNT-1:                vector.load
-    # CHECK:                        amdgpu.lds_barrier
-    # CHECK-COUNT-1:                vector.store
-    # CHECK-COUNT-1:                vector.load
-    # CHECK-COUNT-1:                vector.store
-    # CHECK:                        amdgpu.lds_barrier
-    # CHECK-COUNT-2:                vector.load
-    # CHECK:                        amdgpu.mfma
-    # CHECK:                        scf.yield
-    # CHECK-COUNT-4:                vector.store
-


AI removed a bit too much here..

Signed-off-by: Tim Gymnich <[email protected]>

ftynse · 2025-12-19T18:26:04Z

This is a prime example why we should do more unit-style tests and less integration tests. We should still have some integration tests, but avoid tight coupling in them: just check that key things happen.

set minimize_shared_allocs pass by default to False

9d9b06d

adedespirlet requested review from Hardcode84 and ftynse December 15, 2025 11:23

Hardcode84 requested a review from harsh-nod December 15, 2025 11:29

Add documentation to minimize_shared_alloc pass

e3a0a76

Signed-off-by: Aurore De Spirlet <[email protected]>

adedespirlet force-pushed the disable_minimize_shared_allocs branch from 7666153 to e3a0a76 Compare December 15, 2025 12:51

update tests

6d58a90

Signed-off-by: Tim Gymnich <[email protected]>

tgymnich reviewed Dec 19, 2025

View reviewed changes

tgymnich added 2 commits December 19, 2025 18:14

fix more tests

5dad6c5

Signed-off-by: Tim Gymnich <[email protected]>

fix more tests

fc075ec

Signed-off-by: Tim Gymnich <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Disable minimize_shared_allocs pass by default #574

Disable minimize_shared_allocs pass by default #574

Uh oh!

adedespirlet commented Dec 15, 2025

Uh oh!

ftynse commented Dec 15, 2025

Uh oh!

tgymnich Dec 19, 2025

Uh oh!

tgymnich Dec 19, 2025

Uh oh!

tgymnich Dec 19, 2025

Uh oh!

ftynse commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		# CHECK: memref.alloc
		# CHECK: scf.for %[[ARG5:.]] = %[[C0]] to %[[C4]] step %[[C1]] iter_args(%[[ARG6:.]] = %[[CST:.*]]) -> (vector<4xf32>) {

Disable minimize_shared_allocs pass by default #574

Are you sure you want to change the base?

Disable minimize_shared_allocs pass by default #574

Uh oh!

Conversation

adedespirlet commented Dec 15, 2025

Uh oh!

ftynse commented Dec 15, 2025

Uh oh!

tgymnich Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

tgymnich Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

tgymnich Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

ftynse commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants