Skip to content

Missed optimization: multiple instances of a small struct don't reuse the stack allocation #141649

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ohadravid opened this issue May 27, 2025 · 11 comments
Labels
A-codegen Area: Code generation A-MIR Area: Mid-level IR (MIR) - https://blog.rust-lang.org/2016/04/19/MIR.html A-mir-opt Area: MIR optimizations C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-opsem Relevant to the opsem team

Comments

@ohadravid
Copy link
Contributor

ohadravid commented May 27, 2025

When creating multiple instances of a small struct, each instance will be allocated separately on the stack even if they are known never to overlap.

Example: the following code will generate two alloca calls that are not optimized away by LLVM:

(Godbolt)

pub struct WithOffset<T> {
    pub data: T,
    pub offset: usize,
}

#[inline(never)]
pub fn use_w(w: WithOffset<&[u8; 16]>) {
    std::hint::black_box(w);
}

#[inline(never)]
pub fn peek_w(w: &WithOffset<&[u8; 16]>) {
    std::hint::black_box(w);
}

pub fn offsets(buf: [u8; 16]) {
    let w = WithOffset {
        data: &buf,
        offset: 0,
    };

    peek_w(&w);
    use_w(w);

    let w2 = WithOffset {
        data: &buf,
        offset: 1,
    };

    peek_w(&w2);
    use_w(w2);
}

LLVM IR:

; playground::offsets
; Function Attrs: noinline nounwind
define internal fastcc void @playground::offsets(ptr noalias nocapture noundef nonnull readonly align 1 dereferenceable(16) %buf) unnamed_addr #0 {
start:
  %w2 = alloca [16 x i8], align 8
  %w = alloca [16 x i8], align 8
  store ptr %buf, ptr %w, align 8
  %0 = getelementptr inbounds nuw i8, ptr %w, i64 8
  store i64 0, ptr %0, align 8
; call playground::peek_w
  call fastcc void @playground::peek_w(ptr noalias noundef readonly align 8 dereferenceable(16) %w) #88
; call playground::use_w
  call fastcc void @playground::use_w(ptr noalias noundef readonly align 1 dereferenceable(16) %buf, i64 noundef 0) #88
  store ptr %buf, ptr %w2, align 8
  %1 = getelementptr inbounds nuw i8, ptr %w2, i64 8
  store i64 1, ptr %1, align 8
; call playground::peek_w
  call fastcc void @playground::peek_w(ptr noalias noundef readonly align 8 dereferenceable(16) %w2) #88
; call playground::use_w
  call fastcc void @playground::use_w(ptr noalias noundef readonly align 1 dereferenceable(16) %buf, i64 noundef 1) #88
  ret void
}

It seems like a call to @llvm.lifetime.{start,end}.p0 is missing. If we instead use:

pub fn closures(buf: [u8; 16]) {
    (|| {
        let w = WithOffset {
            data: &buf,
            offset: 0,
        };

        peek_w(&w);
        use_w(w);
    })();

    (|| {
        let w2 = WithOffset {
            data: &buf,
            offset: 1,
        };

        peek_w(&w2);
        use_w(w2);
    })();
}

We do get them and the second alloca is optimized away (see the Godbolt link).

I encountered this when working on memorysafety/rav1d#1402, where this misoptimization results in over 100 bytes of extra allocations in a specific function, which slows down the entire binary by ~0.5%.

This might also be related to #138544

@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label May 27, 2025
@ohadravid ohadravid changed the title Missed optimization: creating multiple instances of a small struct results don't reuse the stack allocation Missed optimization: multiple instances of a small struct don't reuse the stack allocation May 27, 2025
@saethlin
Copy link
Member

This might also be related to #138544

Can you explain how? That issue is about a missing MIR optimization, and this issue is about an LLVM optimization.

@ohadravid
Copy link
Contributor Author

I thought it might be related to that issue because there's a "missing" StorageDead call - but maybe I just got confused 🙂

@lolbinarycat lolbinarycat added T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels May 27, 2025
@hanna-kruppe
Copy link
Contributor

I suspect the desired optimization depends on unresolved details of Rust's operational semantic (e.g., rust-lang/unsafe-code-guidelines#188). Does moving out of w to call use_w have any special significance, in particular, does it guarantee the storage can be deallocated eagerly since nothing appears to re-initialize it? What if WithOffset were Copy? It's possible that stacked borrows or tree borrows can justify it by saying that the reference that escaped into peek_w can't be used anymore after the function returns (but I wouldn't bet on it either). However, it's hard to justify optimizations based on an aliasing model that isn't agreed upon yet.

@hanna-kruppe
Copy link
Contributor

For example, miri (without any flags, at least) doesn't take any issue with this program that stashes a pointer to x1 in the first call to peek and then reads through that pointer in the second call to peek:

use std::ptr;
use std::cell::Cell;

thread_local !{
    static LAST_X: Cell<*const u32> = const { Cell::new(ptr::null()) };
} 

fn peek(x: &u32) -> u32 {
    let last_x: *const u32 = LAST_X.get();
    let result: u32 = if last_x.is_null() {
        *x
    } else {
        unsafe { *last_x }
    };
    LAST_X.set(x);
    result
}


fn main() {
    let x1 = 5;
    dbg!(peek(&x1));
    let x2 = 8;
    dbg!(peek(&x2));
}

@saethlin
Copy link
Member

The discussion in #138544 about storage markers is not related. It is very easy to think you've found a pattern in MIR that can be optimized on with trivial analysis but write a pass that is unsound. I've done it too.

To this specific issue, I think the storage markers are what we want in -Zmir-opt-level=0, and yet the assembly seems unchanged. @hanna-kruppe can you check me on that?

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented May 27, 2025

In the MIR I'm looking at the relevant locals in offsets are _2 and _9, both of which are live until right before the return in bb4. (_8 and _15 are short-lived temporaries only live around during the use_w calls.) This is what I would expect: matching the surface Rust semantics before any optimizations, but would mean they have to go in disjoint stack slots because they have overlapping lifetimes.

@saethlin
Copy link
Member

Ah! I was looking at the wrong locals. You are right.

@ohadravid
Copy link
Contributor Author

ohadravid commented May 27, 2025

@hanna-kruppe that's a very surprising example 😮
BTW - in the Rav1d case, it is a Copy struct.

I think there's something more going on. If I use blocks like this:

pub fn offsets(buf: [u8; 16]) {
    {
        let w = WithOffset {
            data: &buf,
            offset: 0,
        };

        peek_w(&w);
        use_w(w);
    }

    {
        let w2 = WithOffset {
            data: &buf,
            offset: 1,
        };

        peek_w(&w2);
        use_w(w2);
    }
}

There are still be two alloca calls, but Miri will complain about:

fn main() {
    {
        let x1 = 5;
        dbg!(peek(&x1));
    }
    {
        let x2 = 8;
        dbg!(peek(&x2));
    }
}

so I maybe in this case the semantics are defined?

@hanna-kruppe
Copy link
Contributor

Yeah, I haven't looked into it any further, but I would expect that adding blocks so that the first local is dead before the second one is introduced would allow overlapping their storage.

@tmiasko
Copy link
Contributor

tmiasko commented May 27, 2025

In a variant with separate scopes #141649 (comment) overlapping storage of w and w2 would be valid. At the moment there are two MIR passes that inhibit such opimization by removing storage markers: CopyProp and GVN. For example, compare builds with -O and -O -Zmir-enable-passes=-CopyProp,-GVN.

@tmiasko tmiasko added the A-mir-opt Area: MIR optimizations label May 27, 2025
@nikic nikic added A-mir-opt Area: MIR optimizations and removed A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-mir-opt Area: MIR optimizations labels May 27, 2025
@saethlin saethlin added A-MIR Area: Mid-level IR (MIR) - https://blog.rust-lang.org/2016/04/19/MIR.html T-opsem Relevant to the opsem team labels May 27, 2025
@saethlin
Copy link
Member

I don't think it's fair to call this just a mir-opt bug, because even without any optimizations we still generate MIR that is incompatible with the desired optimization.

kkysen added a commit to memorysafety/rav1d that referenced this issue Jun 14, 2025
* Supersedes #1402.

## Summary

There seems to be a slowdown in some of the Arm asm functions due to
increased stack usage in `rav1d_cdef_brow`
(`def_filter4_pri_edged_8bpc_neon` is ~20% slower when called from
rav1d).

By keeping `WithOffset`s "internal" to the function, LLVM is able to
optimize away most of the `alloca`-s (first commit).

After doing that, it seemed that `backup2x8` was still slower when
calling `.stride()` (vs. a version which doesn't have the Rust fallback
args _at all_), which I think is related to how LLVM optimizes the
entire `rav1d_cdef_brow` (second commit).

main:

(`$ cargo asm -p rav1d-cli --bin dav1d --llvm rav1d_cdef_brow 0`)

```llvm
; rav1d::src::cdef_apply::rav1d_cdef_brow
; Function Attrs: nounwind
define internal fastcc void @Rav1d::src::cdef_apply::rav1d_cdef_brow(..) {
  %10 = alloca [16 x i8], align 8
  %11 = alloca [16 x i8], align 8
  %12 = alloca [16 x i8], align 8
  %13 = alloca [16 x i8], align 8
  %14 = alloca [16 x i8], align 8
  %15 = alloca [16 x i8], align 8
  %16 = alloca [16 x i8], align 8
  %17 = alloca [24 x i8], align 8
  %18 = alloca [24 x i8], align 8
  %19 = alloca [4 x i8], align 4
  %20 = alloca [96 x i8], align 16
  %21 = icmp sgt i32 %5, 0
  %22 = select i1 %21, i32 12, i32 8
  %23 = load ptr, ptr %3, align 8
  ..
```

This branch:
```llvm
define internal fastcc void @Rav1d::src::cdef_apply::rav1d_cdef_brow(..) {
  %10 = alloca [16 x i8], align 8
  %11 = alloca [24 x i8], align 8
  %12 = alloca [24 x i8], align 8
  %13 = alloca [4 x i8], align 4
  %14 = alloca [96 x i8], align 16
```

~This branch:~
(This is the most "optimized" version in this sense, but couldn't get
the same with the current Rust/Asm argument arrangement requirements:)

```llvm
; rav1d::src::cdef_apply::rav1d_cdef_brow
; Function Attrs: nounwind
define internal fastcc void @Rav1d::src::cdef_apply::rav1d_cdef_brow(..) {
  %10 = alloca [16 x i8], align 8
  %11 = alloca [4 x i8], align 4
  %12 = alloca [96 x i8], align 16
  %13 = icmp sgt i32 %5, 0
  %14 = select i1 %13, i32 12, i32 8
  %15 = load ptr, ptr %3, align 8
  ..
```


## Full details

When comparing to `dav1d`, the `cdef_filter4_pri_edged_8bpc_neon` asm
function seems to be ~20% slower when called from `rav1d`.

Looking at the per-instruction sample count, there's one with a big,
consistent diff:

dav1d:

![dav1d](https://github.com/user-attachments/assets/2bc606aa-f27e-4e09-9c45-403cf6fc780d)

rav1d:

![rav1d](https://github.com/user-attachments/assets/607c87b7-43c3-44d5-9e7e-2213d3dc4362)

From this, I tried to look at the callers. The problem _seems_ to be
that when the src buffer in `x13` is placed "far enough back" in the
stack, the load stalls.

This fix is inspired by what I saw in
rust-lang/rust#141649 - since `WithOffset`s
seemed to match the `%10,..,%18` `alloca`-s in the IR above, I tried to
force LLVM to do the right thing and optimize them away (I tried a few
other things, but this is the most effective one).


With this fix, the diff in the sample count is gone, with a nice
speedup:

``` bash
rav1d-pr % hyperfine --warmup 3 --runs 15 --parameter-list profile target/release/dav1d,target/release-baseline/dav1d  '{profile} -q -i ~/workspace/video_files_for_rav1d/Chimera-AV1-8bit-1920x1080-6736kb
ps.ivf -o /dev/null --threads 1'

Benchmark 1: target/release/dav1d -q -i Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null --threads 1
  Time (mean ± σ):     71.918 s ±  0.066 s    [User: 71.595 s, System: 0.233 s]
  Range (min … max):   71.812 s … 72.032 s    15 runs

Benchmark 2: target/release-baseline/dav1d -q -i Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null --threads 1
  Time (mean ± σ):     72.294 s ±  0.078 s    [User: 71.945 s, System: 0.235 s]
  Range (min … max):   72.183 s … 72.438 s    15 runs

release-baseline = 1be76ea
```

I'm not sure how much better this will be in x86, but it should still be
faster.
bors added a commit that referenced this issue Jun 15, 2025
…try>

Remove fewer Storage calls in `copy_prop`

Modify the `copy_prop` MIR optimization pass to remove fewer `Storage{Live,Dead}` calls, allowing for better optimizations by LLVM - see #141649.

### Details

This is my attempt to fix the mentioned issue (this is the first part, I also implemented a similar solution for GVN in [this branch](https://github.com/rust-lang/rust/compare/master...ohadravid:rust:better-storage-calls-gvn-v2?expand=1)).

The idea is to use the `MaybeStorageDead` analysis and remove only the storage calls of `head`s that are maybe-storage-dead when the associated `local` is accessed (or, conversely, keep the storage of `head`s that are for-sure alive in _every_ relevant access).

When combined with the GVN change, the final example in the issue (#141649 (comment)) is optimized as expected by LLVM. I also measured the effect on a few functions in `rav1d` (where I originally saw the issue) and observed reduced stack usage in several of them.

This is my first attempt at working with MIR optimizations, so it's possible this isn't the right approach — but all tests pass, and the resulting diffs appear correct.

r? tmiasko

since he commented on the issue and pointed to these passes.
bors added a commit that referenced this issue Jun 15, 2025
…try>

Remove fewer Storage calls in `copy_prop`

Modify the `copy_prop` MIR optimization pass to remove fewer `Storage{Live,Dead}` calls, allowing for better optimizations by LLVM - see #141649.

### Details

This is my attempt to fix the mentioned issue (this is the first part, I also implemented a similar solution for GVN in [this branch](https://github.com/rust-lang/rust/compare/master...ohadravid:rust:better-storage-calls-gvn-v2?expand=1)).

The idea is to use the `MaybeStorageDead` analysis and remove only the storage calls of `head`s that are maybe-storage-dead when the associated `local` is accessed (or, conversely, keep the storage of `head`s that are for-sure alive in _every_ relevant access).

When combined with the GVN change, the final example in the issue (#141649 (comment)) is optimized as expected by LLVM. I also measured the effect on a few functions in `rav1d` (where I originally saw the issue) and observed reduced stack usage in several of them.

This is my first attempt at working with MIR optimizations, so it's possible this isn't the right approach — but all tests pass, and the resulting diffs appear correct.

r? tmiasko

since he commented on the issue and pointed to these passes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation A-MIR Area: Mid-level IR (MIR) - https://blog.rust-lang.org/2016/04/19/MIR.html A-mir-opt Area: MIR optimizations C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-opsem Relevant to the opsem team
Projects
None yet
Development

No branches or pull requests

7 participants