-
Notifications
You must be signed in to change notification settings - Fork 25
Disable minimize_shared_allocs pass by default #574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Disable minimize_shared_allocs pass by default #574
Conversation
Signed-off-by: Aurore De Spirlet <[email protected]>
7666153 to
e3a0a76
Compare
|
I suspect tests need to be updated, and we also want to keep it on in the options for extend attention and the likes. We can also think about a heuristic approach where we compute the total LDS footprint required for the kernel and dynamically turn this pass on if it exceeds the available amount, but that requires some target information to be available and may be better done with a MLIR-based pass where arch information is easier to come by. So not block on this. |
Signed-off-by: Tim Gymnich <[email protected]>
| # CHECK-COUNT-1: memref.alloc | ||
| # CHECK: scf.for %[[ARG3:.*]] = %[[C0]] to %[[C10]] step %[[C1]] { | ||
| # CHECK: scf.for %arg4 = %[[C0]] to %[[C4]] step %[[C1]] | ||
| # CHECK: memref.alloc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
potential regression
| # CHECK: memref.alloc | ||
| # CHECK: scf.for %[[ARG5:.*]] = %[[C0]] to %[[C4]] step %[[C1]] iter_args(%[[ARG6:.*]] = %[[CST:.*]]) -> (vector<4xf32>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
potential regression
| # CHECK-DAG: %[[C1:.*]] = arith.constant 1 : index | ||
| # CHECK-DAG: %[[C4:.*]] = arith.constant 4 : index | ||
| # CHECK-DAG: %[[C0:.*]] = arith.constant 0 : index | ||
| # CHECK-DAG: %[[C10:.*]] = arith.constant 10 : index | ||
| # CHECK-DAG: %[[CST_0:.*]] = arith.constant dense<0.000000e+00> : vector<4xf32> | ||
| # CHECK-COUNT-1: memref.alloc | ||
| # CHECK: %[[CAST_INIT_B:.*]] = arith.index_cast | ||
| # CHECK: %[[WHILE:.*]] = scf.while (%[[ACC:.*]] = %[[CST_0]], %[[B:.*]] = %[[CAST_INIT_B]]) : (vector<4xf32>, index) -> (vector<4xf32>, index) { | ||
| # CHECK: %[[COND:.*]] = arith.cmpi slt, %[[B]], %[[C10]] : index | ||
| # CHECK: scf.condition(%[[COND]]) %[[ACC]], %[[B]] : vector<4xf32>, index | ||
| # CHECK: } do { | ||
| # CHECK: ^bb0(%[[ACC:.*]]: vector<4xf32>, %[[B:.*]]: index): | ||
| # CHECK: %[[FOR:.*]] = scf.for %[[ARG7:.*]] = %[[C0]] to %[[C4]] step %[[C1]] iter_args(%[[ARG8:.*]] = %[[CST_0]]) -> (vector<4xf32>) { | ||
| # CHECK-COUNT-1: vector.load | ||
| # CHECK: amdgpu.lds_barrier | ||
| # CHECK-COUNT-1: vector.store | ||
| # CHECK-COUNT-1: vector.load | ||
| # CHECK-COUNT-1: vector.store | ||
| # CHECK: amdgpu.lds_barrier | ||
| # CHECK-COUNT-2: vector.load | ||
| # CHECK: amdgpu.mfma | ||
| # CHECK: scf.yield | ||
| # CHECK-COUNT-4: vector.store | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AI removed a bit too much here..
Signed-off-by: Tim Gymnich <[email protected]>
Signed-off-by: Tim Gymnich <[email protected]>
|
This is a prime example why we should do more unit-style tests and less integration tests. We should still have some integration tests, but avoid tight coupling in them: just check that key things happen. |
This change sets the default behavior of the minimize_shared_allocs pass to False.
Currently this pass reduces peak LDS memory usage by combining distinct memref.alloc operations into a single allocation with views. However this prevents the backend from inferring aliasing information. As a result the backend conservatively inserts unnecessary wait count instructions (vmcnt(0)), which breaks software pipelining.
While this optimization is beneficial for kernels with disjoint buffer lifecycles (like extended attention) it causes significant performance regressions in standard pipelined kernels.
Let's disable it by default and enable it explicitly only for the specific cases where it is required.