Missed optimization: multiple instances of a small struct don't reuse the stack allocation #141649

ohadravid · 2025-05-27T13:19:48Z

When creating multiple instances of a small struct, each instance will be allocated separately on the stack even if they are known never to overlap.

Example: the following code will generate two alloca calls that are not optimized away by LLVM:

(Godbolt)

pub struct WithOffset<T> {
    pub data: T,
    pub offset: usize,
}

#[inline(never)]
pub fn use_w(w: WithOffset<&[u8; 16]>) {
    std::hint::black_box(w);
}

#[inline(never)]
pub fn peek_w(w: &WithOffset<&[u8; 16]>) {
    std::hint::black_box(w);
}

pub fn offsets(buf: [u8; 16]) {
    let w = WithOffset {
        data: &buf,
        offset: 0,
    };

    peek_w(&w);
    use_w(w);

    let w2 = WithOffset {
        data: &buf,
        offset: 1,
    };

    peek_w(&w2);
    use_w(w2);
}

LLVM IR:

; playground::offsets
; Function Attrs: noinline nounwind
define internal fastcc void @playground::offsets(ptr noalias nocapture noundef nonnull readonly align 1 dereferenceable(16) %buf) unnamed_addr #0 {
start:
  %w2 = alloca [16 x i8], align 8
  %w = alloca [16 x i8], align 8
  store ptr %buf, ptr %w, align 8
  %0 = getelementptr inbounds nuw i8, ptr %w, i64 8
  store i64 0, ptr %0, align 8
; call playground::peek_w
  call fastcc void @playground::peek_w(ptr noalias noundef readonly align 8 dereferenceable(16) %w) #88
; call playground::use_w
  call fastcc void @playground::use_w(ptr noalias noundef readonly align 1 dereferenceable(16) %buf, i64 noundef 0) #88
  store ptr %buf, ptr %w2, align 8
  %1 = getelementptr inbounds nuw i8, ptr %w2, i64 8
  store i64 1, ptr %1, align 8
; call playground::peek_w
  call fastcc void @playground::peek_w(ptr noalias noundef readonly align 8 dereferenceable(16) %w2) #88
; call playground::use_w
  call fastcc void @playground::use_w(ptr noalias noundef readonly align 1 dereferenceable(16) %buf, i64 noundef 1) #88
  ret void
}

It seems like a call to @llvm.lifetime.{start,end}.p0 is missing. If we instead use:

pub fn closures(buf: [u8; 16]) {
    (|| {
        let w = WithOffset {
            data: &buf,
            offset: 0,
        };

        peek_w(&w);
        use_w(w);
    })();

    (|| {
        let w2 = WithOffset {
            data: &buf,
            offset: 1,
        };

        peek_w(&w2);
        use_w(w2);
    })();
}

We do get them and the second alloca is optimized away (see the Godbolt link).

I encountered this when working on memorysafety/rav1d#1402, where this misoptimization results in over 100 bytes of extra allocations in a specific function, which slows down the entire binary by ~0.5%.

This might also be related to #138544

The text was updated successfully, but these errors were encountered:

saethlin · 2025-05-27T14:52:27Z

This might also be related to #138544

Can you explain how? That issue is about a missing MIR optimization, and this issue is about an LLVM optimization.

ohadravid · 2025-05-27T15:37:48Z

I thought it might be related to that issue because there's a "missing" StorageDead call - but maybe I just got confused 🙂

hanna-kruppe · 2025-05-27T16:56:15Z

I suspect the desired optimization depends on unresolved details of Rust's operational semantic (e.g., rust-lang/unsafe-code-guidelines#188). Does moving out of w to call use_w have any special significance, in particular, does it guarantee the storage can be deallocated eagerly since nothing appears to re-initialize it? What if WithOffset were Copy? It's possible that stacked borrows or tree borrows can justify it by saying that the reference that escaped into peek_w can't be used anymore after the function returns (but I wouldn't bet on it either). However, it's hard to justify optimizations based on an aliasing model that isn't agreed upon yet.

hanna-kruppe · 2025-05-27T17:03:08Z

For example, miri (without any flags, at least) doesn't take any issue with this program that stashes a pointer to x1 in the first call to peek and then reads through that pointer in the second call to peek:

use std::ptr;
use std::cell::Cell;

thread_local !{
    static LAST_X: Cell<*const u32> = const { Cell::new(ptr::null()) };
} 

fn peek(x: &u32) -> u32 {
    let last_x: *const u32 = LAST_X.get();
    let result: u32 = if last_x.is_null() {
        *x
    } else {
        unsafe { *last_x }
    };
    LAST_X.set(x);
    result
}


fn main() {
    let x1 = 5;
    dbg!(peek(&x1));
    let x2 = 8;
    dbg!(peek(&x2));
}

saethlin · 2025-05-27T17:32:50Z

The discussion in #138544 about storage markers is not related. It is very easy to think you've found a pattern in MIR that can be optimized on with trivial analysis but write a pass that is unsound. I've done it too.

To this specific issue, I think the storage markers are what we want in -Zmir-opt-level=0, and yet the assembly seems unchanged. @hanna-kruppe can you check me on that?

hanna-kruppe · 2025-05-27T17:49:35Z

In the MIR I'm looking at the relevant locals in offsets are _2 and _9, both of which are live until right before the return in bb4. (_8 and _15 are short-lived temporaries only live around during the use_w calls.) This is what I would expect: matching the surface Rust semantics before any optimizations, but would mean they have to go in disjoint stack slots because they have overlapping lifetimes.

saethlin · 2025-05-27T17:56:41Z

Ah! I was looking at the wrong locals. You are right.

ohadravid · 2025-05-27T18:05:49Z

@hanna-kruppe that's a very surprising example 😮
BTW - in the Rav1d case, it is a Copy struct.

I think there's something more going on. If I use blocks like this:

pub fn offsets(buf: [u8; 16]) {
    {
        let w = WithOffset {
            data: &buf,
            offset: 0,
        };

        peek_w(&w);
        use_w(w);
    }

    {
        let w2 = WithOffset {
            data: &buf,
            offset: 1,
        };

        peek_w(&w2);
        use_w(w2);
    }
}

There are still be two alloca calls, but Miri will complain about:

fn main() {
    {
        let x1 = 5;
        dbg!(peek(&x1));
    }
    {
        let x2 = 8;
        dbg!(peek(&x2));
    }
}

so I maybe in this case the semantics are defined?

hanna-kruppe · 2025-05-27T18:17:53Z

Yeah, I haven't looked into it any further, but I would expect that adding blocks so that the first local is dead before the second one is introduced would allow overlapping their storage.

tmiasko · 2025-05-27T18:50:52Z

In a variant with separate scopes #141649 (comment) overlapping storage of w and w2 would be valid. At the moment there are two MIR passes that inhibit such opimization by removing storage markers: CopyProp and GVN. For example, compare builds with -O and -O -Zmir-enable-passes=-CopyProp,-GVN.

saethlin · 2025-05-27T19:26:40Z

I don't think it's fair to call this just a mir-opt bug, because even without any optimizations we still generate MIR that is incompatible with the desired optimization.

@Rav1d

* Supersedes #1402. ## Summary There seems to be a slowdown in some of the Arm asm functions due to increased stack usage in `rav1d_cdef_brow` (`def_filter4_pri_edged_8bpc_neon` is ~20% slower when called from rav1d). By keeping `WithOffset`s "internal" to the function, LLVM is able to optimize away most of the `alloca`-s (first commit). After doing that, it seemed that `backup2x8` was still slower when calling `.stride()` (vs. a version which doesn't have the Rust fallback args _at all_), which I think is related to how LLVM optimizes the entire `rav1d_cdef_brow` (second commit). main: (`$ cargo asm -p rav1d-cli --bin dav1d --llvm rav1d_cdef_brow 0`) ```llvm ; rav1d::src::cdef_apply::rav1d_cdef_brow ; Function Attrs: nounwind define internal fastcc void @Rav1d::src::cdef_apply::rav1d_cdef_brow(..) { %10 = alloca [16 x i8], align 8 %11 = alloca [16 x i8], align 8 %12 = alloca [16 x i8], align 8 %13 = alloca [16 x i8], align 8 %14 = alloca [16 x i8], align 8 %15 = alloca [16 x i8], align 8 %16 = alloca [16 x i8], align 8 %17 = alloca [24 x i8], align 8 %18 = alloca [24 x i8], align 8 %19 = alloca [4 x i8], align 4 %20 = alloca [96 x i8], align 16 %21 = icmp sgt i32 %5, 0 %22 = select i1 %21, i32 12, i32 8 %23 = load ptr, ptr %3, align 8 .. ``` This branch: ```llvm define internal fastcc void @Rav1d::src::cdef_apply::rav1d_cdef_brow(..) { %10 = alloca [16 x i8], align 8 %11 = alloca [24 x i8], align 8 %12 = alloca [24 x i8], align 8 %13 = alloca [4 x i8], align 4 %14 = alloca [96 x i8], align 16 ``` ~This branch:~ (This is the most "optimized" version in this sense, but couldn't get the same with the current Rust/Asm argument arrangement requirements:) ```llvm ; rav1d::src::cdef_apply::rav1d_cdef_brow ; Function Attrs: nounwind define internal fastcc void @Rav1d::src::cdef_apply::rav1d_cdef_brow(..) { %10 = alloca [16 x i8], align 8 %11 = alloca [4 x i8], align 4 %12 = alloca [96 x i8], align 16 %13 = icmp sgt i32 %5, 0 %14 = select i1 %13, i32 12, i32 8 %15 = load ptr, ptr %3, align 8 .. ``` ## Full details When comparing to `dav1d`, the `cdef_filter4_pri_edged_8bpc_neon` asm function seems to be ~20% slower when called from `rav1d`. Looking at the per-instruction sample count, there's one with a big, consistent diff: dav1d: ![dav1d](https://github.com/user-attachments/assets/2bc606aa-f27e-4e09-9c45-403cf6fc780d) rav1d: ![rav1d](https://github.com/user-attachments/assets/607c87b7-43c3-44d5-9e7e-2213d3dc4362) From this, I tried to look at the callers. The problem _seems_ to be that when the src buffer in `x13` is placed "far enough back" in the stack, the load stalls. This fix is inspired by what I saw in rust-lang/rust#141649 - since `WithOffset`s seemed to match the `%10,..,%18` `alloca`-s in the IR above, I tried to force LLVM to do the right thing and optimize them away (I tried a few other things, but this is the most effective one). With this fix, the diff in the sample count is gone, with a nice speedup: ``` bash rav1d-pr % hyperfine --warmup 3 --runs 15 --parameter-list profile target/release/dav1d,target/release-baseline/dav1d '{profile} -q -i ~/workspace/video_files_for_rav1d/Chimera-AV1-8bit-1920x1080-6736kb ps.ivf -o /dev/null --threads 1' Benchmark 1: target/release/dav1d -q -i Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null --threads 1 Time (mean ± σ): 71.918 s ± 0.066 s [User: 71.595 s, System: 0.233 s] Range (min … max): 71.812 s … 72.032 s 15 runs Benchmark 2: target/release-baseline/dav1d -q -i Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null --threads 1 Time (mean ± σ): 72.294 s ± 0.078 s [User: 71.945 s, System: 0.235 s] Range (min … max): 72.183 s … 72.438 s 15 runs release-baseline = 1be76ea ``` I'm not sure how much better this will be in x86, but it should still be faster.

…try> Remove fewer Storage calls in `copy_prop` Modify the `copy_prop` MIR optimization pass to remove fewer `Storage{Live,Dead}` calls, allowing for better optimizations by LLVM - see #141649. ### Details This is my attempt to fix the mentioned issue (this is the first part, I also implemented a similar solution for GVN in [this branch](https://github.com/rust-lang/rust/compare/master...ohadravid:rust:better-storage-calls-gvn-v2?expand=1)). The idea is to use the `MaybeStorageDead` analysis and remove only the storage calls of `head`s that are maybe-storage-dead when the associated `local` is accessed (or, conversely, keep the storage of `head`s that are for-sure alive in _every_ relevant access). When combined with the GVN change, the final example in the issue (#141649 (comment)) is optimized as expected by LLVM. I also measured the effect on a few functions in `rav1d` (where I originally saw the issue) and observed reduced stack usage in several of them. This is my first attempt at working with MIR optimizations, so it's possible this isn't the right approach — but all tests pass, and the resulting diffs appear correct. r? tmiasko since he commented on the issue and pointed to these passes.

rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label May 27, 2025

ohadravid changed the title ~~Missed optimization: creating multiple instances of a small struct results don't reuse the stack allocation~~ Missed optimization: multiple instances of a small struct don't reuse the stack allocation May 27, 2025

tmiasko added the A-mir-opt Area: MIR optimizations label May 27, 2025

nikic added A-mir-opt Area: MIR optimizations and removed A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-mir-opt Area: MIR optimizations labels May 27, 2025

saethlin added A-MIR Area: Mid-level IR (MIR) - https://blog.rust-lang.org/2016/04/19/MIR.html T-opsem Relevant to the opsem team labels May 27, 2025

ohadravid mentioned this issue May 29, 2025

perf: fn rav1d_cdef_brow: reduce stack usage memorysafety/rav1d#1418

Merged

ohadravid mentioned this issue Jun 15, 2025

Remove fewer Storage calls in copy_prop #142531

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Missed optimization: multiple instances of a small struct don't reuse the stack allocation #141649

Missed optimization: multiple instances of a small struct don't reuse the stack allocation #141649

ohadravid commented May 27, 2025 •

edited

Loading

saethlin commented May 27, 2025

Uh oh!

ohadravid commented May 27, 2025

Uh oh!

hanna-kruppe commented May 27, 2025

Uh oh!

hanna-kruppe commented May 27, 2025

Uh oh!

saethlin commented May 27, 2025

Uh oh!

hanna-kruppe commented May 27, 2025 •

edited

Loading

Uh oh!

saethlin commented May 27, 2025

Uh oh!

ohadravid commented May 27, 2025 •

edited

Loading

Uh oh!

hanna-kruppe commented May 27, 2025

Uh oh!

tmiasko commented May 27, 2025

Uh oh!

saethlin commented May 27, 2025

Uh oh!

Missed optimization: multiple instances of a small struct don't reuse the stack allocation #141649

Missed optimization: multiple instances of a small struct don't reuse the stack allocation #141649

Comments

ohadravid commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

saethlin commented May 27, 2025

Uh oh!

ohadravid commented May 27, 2025

Uh oh!

hanna-kruppe commented May 27, 2025

Uh oh!

hanna-kruppe commented May 27, 2025

Uh oh!

saethlin commented May 27, 2025

Uh oh!

hanna-kruppe commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

saethlin commented May 27, 2025

Uh oh!

ohadravid commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hanna-kruppe commented May 27, 2025

Uh oh!

tmiasko commented May 27, 2025

Uh oh!

saethlin commented May 27, 2025

Uh oh!

ohadravid commented May 27, 2025 •

edited

Loading

hanna-kruppe commented May 27, 2025 •

edited

Loading

ohadravid commented May 27, 2025 •

edited

Loading