-
Notifications
You must be signed in to change notification settings - Fork 13.4k
Missed optimization: multiple instances of a small struct don't reuse the stack allocation #141649
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can you explain how? That issue is about a missing MIR optimization, and this issue is about an LLVM optimization. |
I thought it might be related to that issue because there's a "missing" |
I suspect the desired optimization depends on unresolved details of Rust's operational semantic (e.g., rust-lang/unsafe-code-guidelines#188). Does moving out of |
For example, miri (without any flags, at least) doesn't take any issue with this program that stashes a pointer to use std::ptr;
use std::cell::Cell;
thread_local !{
static LAST_X: Cell<*const u32> = const { Cell::new(ptr::null()) };
}
fn peek(x: &u32) -> u32 {
let last_x: *const u32 = LAST_X.get();
let result: u32 = if last_x.is_null() {
*x
} else {
unsafe { *last_x }
};
LAST_X.set(x);
result
}
fn main() {
let x1 = 5;
dbg!(peek(&x1));
let x2 = 8;
dbg!(peek(&x2));
} |
The discussion in #138544 about storage markers is not related. It is very easy to think you've found a pattern in MIR that can be optimized on with trivial analysis but write a pass that is unsound. I've done it too. To this specific issue, I think the storage markers are what we want in |
In the MIR I'm looking at the relevant locals in |
Ah! I was looking at the wrong locals. You are right. |
@hanna-kruppe that's a very surprising example 😮 I think there's something more going on. If I use blocks like this: pub fn offsets(buf: [u8; 16]) {
{
let w = WithOffset {
data: &buf,
offset: 0,
};
peek_w(&w);
use_w(w);
}
{
let w2 = WithOffset {
data: &buf,
offset: 1,
};
peek_w(&w2);
use_w(w2);
}
} There are still be two fn main() {
{
let x1 = 5;
dbg!(peek(&x1));
}
{
let x2 = 8;
dbg!(peek(&x2));
}
} so I maybe in this case the semantics are defined? |
Yeah, I haven't looked into it any further, but I would expect that adding blocks so that the first local is dead before the second one is introduced would allow overlapping their storage. |
In a variant with separate scopes #141649 (comment) overlapping storage of |
I don't think it's fair to call this just a mir-opt bug, because even without any optimizations we still generate MIR that is incompatible with the desired optimization. |
* Supersedes #1402. ## Summary There seems to be a slowdown in some of the Arm asm functions due to increased stack usage in `rav1d_cdef_brow` (`def_filter4_pri_edged_8bpc_neon` is ~20% slower when called from rav1d). By keeping `WithOffset`s "internal" to the function, LLVM is able to optimize away most of the `alloca`-s (first commit). After doing that, it seemed that `backup2x8` was still slower when calling `.stride()` (vs. a version which doesn't have the Rust fallback args _at all_), which I think is related to how LLVM optimizes the entire `rav1d_cdef_brow` (second commit). main: (`$ cargo asm -p rav1d-cli --bin dav1d --llvm rav1d_cdef_brow 0`) ```llvm ; rav1d::src::cdef_apply::rav1d_cdef_brow ; Function Attrs: nounwind define internal fastcc void @Rav1d::src::cdef_apply::rav1d_cdef_brow(..) { %10 = alloca [16 x i8], align 8 %11 = alloca [16 x i8], align 8 %12 = alloca [16 x i8], align 8 %13 = alloca [16 x i8], align 8 %14 = alloca [16 x i8], align 8 %15 = alloca [16 x i8], align 8 %16 = alloca [16 x i8], align 8 %17 = alloca [24 x i8], align 8 %18 = alloca [24 x i8], align 8 %19 = alloca [4 x i8], align 4 %20 = alloca [96 x i8], align 16 %21 = icmp sgt i32 %5, 0 %22 = select i1 %21, i32 12, i32 8 %23 = load ptr, ptr %3, align 8 .. ``` This branch: ```llvm define internal fastcc void @Rav1d::src::cdef_apply::rav1d_cdef_brow(..) { %10 = alloca [16 x i8], align 8 %11 = alloca [24 x i8], align 8 %12 = alloca [24 x i8], align 8 %13 = alloca [4 x i8], align 4 %14 = alloca [96 x i8], align 16 ``` ~This branch:~ (This is the most "optimized" version in this sense, but couldn't get the same with the current Rust/Asm argument arrangement requirements:) ```llvm ; rav1d::src::cdef_apply::rav1d_cdef_brow ; Function Attrs: nounwind define internal fastcc void @Rav1d::src::cdef_apply::rav1d_cdef_brow(..) { %10 = alloca [16 x i8], align 8 %11 = alloca [4 x i8], align 4 %12 = alloca [96 x i8], align 16 %13 = icmp sgt i32 %5, 0 %14 = select i1 %13, i32 12, i32 8 %15 = load ptr, ptr %3, align 8 .. ``` ## Full details When comparing to `dav1d`, the `cdef_filter4_pri_edged_8bpc_neon` asm function seems to be ~20% slower when called from `rav1d`. Looking at the per-instruction sample count, there's one with a big, consistent diff: dav1d:  rav1d:  From this, I tried to look at the callers. The problem _seems_ to be that when the src buffer in `x13` is placed "far enough back" in the stack, the load stalls. This fix is inspired by what I saw in rust-lang/rust#141649 - since `WithOffset`s seemed to match the `%10,..,%18` `alloca`-s in the IR above, I tried to force LLVM to do the right thing and optimize them away (I tried a few other things, but this is the most effective one). With this fix, the diff in the sample count is gone, with a nice speedup: ``` bash rav1d-pr % hyperfine --warmup 3 --runs 15 --parameter-list profile target/release/dav1d,target/release-baseline/dav1d '{profile} -q -i ~/workspace/video_files_for_rav1d/Chimera-AV1-8bit-1920x1080-6736kb ps.ivf -o /dev/null --threads 1' Benchmark 1: target/release/dav1d -q -i Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null --threads 1 Time (mean ± σ): 71.918 s ± 0.066 s [User: 71.595 s, System: 0.233 s] Range (min … max): 71.812 s … 72.032 s 15 runs Benchmark 2: target/release-baseline/dav1d -q -i Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null --threads 1 Time (mean ± σ): 72.294 s ± 0.078 s [User: 71.945 s, System: 0.235 s] Range (min … max): 72.183 s … 72.438 s 15 runs release-baseline = 1be76ea ``` I'm not sure how much better this will be in x86, but it should still be faster.
…try> Remove fewer Storage calls in `copy_prop` Modify the `copy_prop` MIR optimization pass to remove fewer `Storage{Live,Dead}` calls, allowing for better optimizations by LLVM - see #141649. ### Details This is my attempt to fix the mentioned issue (this is the first part, I also implemented a similar solution for GVN in [this branch](https://github.com/rust-lang/rust/compare/master...ohadravid:rust:better-storage-calls-gvn-v2?expand=1)). The idea is to use the `MaybeStorageDead` analysis and remove only the storage calls of `head`s that are maybe-storage-dead when the associated `local` is accessed (or, conversely, keep the storage of `head`s that are for-sure alive in _every_ relevant access). When combined with the GVN change, the final example in the issue (#141649 (comment)) is optimized as expected by LLVM. I also measured the effect on a few functions in `rav1d` (where I originally saw the issue) and observed reduced stack usage in several of them. This is my first attempt at working with MIR optimizations, so it's possible this isn't the right approach — but all tests pass, and the resulting diffs appear correct. r? tmiasko since he commented on the issue and pointed to these passes.
…try> Remove fewer Storage calls in `copy_prop` Modify the `copy_prop` MIR optimization pass to remove fewer `Storage{Live,Dead}` calls, allowing for better optimizations by LLVM - see #141649. ### Details This is my attempt to fix the mentioned issue (this is the first part, I also implemented a similar solution for GVN in [this branch](https://github.com/rust-lang/rust/compare/master...ohadravid:rust:better-storage-calls-gvn-v2?expand=1)). The idea is to use the `MaybeStorageDead` analysis and remove only the storage calls of `head`s that are maybe-storage-dead when the associated `local` is accessed (or, conversely, keep the storage of `head`s that are for-sure alive in _every_ relevant access). When combined with the GVN change, the final example in the issue (#141649 (comment)) is optimized as expected by LLVM. I also measured the effect on a few functions in `rav1d` (where I originally saw the issue) and observed reduced stack usage in several of them. This is my first attempt at working with MIR optimizations, so it's possible this isn't the right approach — but all tests pass, and the resulting diffs appear correct. r? tmiasko since he commented on the issue and pointed to these passes.
Uh oh!
There was an error while loading. Please reload this page.
When creating multiple instances of a small struct, each instance will be allocated separately on the stack even if they are known never to overlap.
Example: the following code will generate two
alloca
calls that are not optimized away by LLVM:(Godbolt)
LLVM IR:
It seems like a call to
@llvm.lifetime.{start,end}.p0
is missing. If we instead use:We do get them and the second
alloca
is optimized away (see the Godbolt link).I encountered this when working on memorysafety/rav1d#1402, where this misoptimization results in over 100 bytes of extra allocations in a specific function, which slows down the entire binary by ~0.5%.
This might also be related to #138544
The text was updated successfully, but these errors were encountered: