Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missed autovectorization on array copy using AllocArrays.jl #57799

Open
Octogonapus opened this issue Mar 17, 2025 · 9 comments
Open

Missed autovectorization on array copy using AllocArrays.jl #57799

Octogonapus opened this issue Mar 17, 2025 · 9 comments
Labels
compiler:llvm For issues that relate to LLVM performance Must go faster

Comments

@Octogonapus
Copy link
Contributor

Octogonapus commented Mar 17, 2025

On the latest master (as of 3e7cf64), autovectorization fails when using AllocArrays.jl v0.1.1 (latest) for an array copy:

using AllocArrays

function mycopy(dest, src, iter)
    # precondition: dest and src do not alias
    # precondition: the iterators of dest and src are equal
    for i in iter
        @inbounds dest[i] = src[i]
    end
    return dest
end

b = BumperAllocator(2^30);
arr = rand(Float32, 50000000);
arr2 = similar(arr);
a = AllocArray(arr);

# this is vectorized
mycopy(arr2, arr, eachindex(arr2));

# this is not
with_allocator(b) do
    c = similar(a)
    mycopy(c, a, eachindex(c))
end;

We can observe the effect on performance:

julia> @benchmark mycopy(arr2, arr, eachindex(arr2))
BenchmarkTools.Trial: 339 samples with 1 evaluation per sample.
 Range (min  max):  14.181 ms   16.030 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     14.609 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   14.704 ms ± 349.402 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▁  ▁▁ ▂▄▅▄█▄▂▄▁▁  ▁    ▁
  ▄▄█▆▆██▇██████████▇▄█▇▆▄▆█▄▅▆▃▆▃▅▄▄▄▃▃▃▄▁▃▁▁▃▄▁▃▁▁▁▁▁▃▁▁▁▁▁▃ ▄
  14.2 ms         Histogram: frequency by time           16 ms <

 Memory estimate: 16 bytes, allocs estimate: 1.

julia> @benchmark with_allocator(() -> mycopy(c, a, eachindex(c)), b) setup=(reset!(b);c=similar(a);)
BenchmarkTools.Trial: 213 samples with 1 evaluation per sample.
 Range (min  max):  20.352 ms  55.386 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     22.032 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   22.426 ms ±  2.889 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▁▆█▂▂▆▅▇▃
  ▄▅█████████▆▄▆▄▃▅▁▁▃▁▃▁▁▁▁▁▁▃▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▃ ▃
  20.4 ms         Histogram: frequency by time        33.6 ms <

 Memory estimate: 304 bytes, allocs estimate: 12.

Here is the LLVM code for the base case: https://pastebin.com/gCbp1uEX
and for the AllocArrays case: https://pastebin.com/bKYmswnd

@Octogonapus
Copy link
Contributor Author

AllocArrays v0.1.1 also includes some weird dispatch due to the latest compatible version of Bumper.jl depending on StrideArrraysCore.jl, but even without that disruptive package the generated code is the same.

@IanButterworth IanButterworth added performance Must go faster compiler:llvm For issues that relate to LLVM labels Mar 17, 2025
@gbaraldi
Copy link
Member

Ok, this is a known issue we have currently with the gc_loaded intrinsic. I need to refactor that code (which is not super simple) but it should allow for the optimizations to happen

@giordano
Copy link
Contributor

giordano commented Mar 17, 2025

20 ms vs 14 ms doesn't sound to be only due to missed vectorisation (you'd see a factor of some integer multiple of 2 of difference if that was the main bottleneck), maybe the code is more complicated overall?

@ericphanson
Copy link
Contributor

ericphanson commented Mar 17, 2025

fwiw here is a reduction to UnsafeArrays which is used under the hood:

using UnsafeArrays, BenchmarkTools

function mycopy(dest, src, iter)
    # precondition: dest and src do not alias
    # precondition: the iterators of dest and src are equal
    for i in iter
        @inbounds dest[i] = src[i]
    end
    return dest
end

arr = rand(Float32, 50000000);
arr2 = similar(arr);
display(@benchmark mycopy(arr2, arr, eachindex(arr2)))


arr3 = copy(arr)
u_arr = UnsafeArray(pointer(arr3), size(arr3))
display(@benchmark mycopy(arr2, u_arr, eachindex(arr2)))

which gives on my m1 laptop

julia> display(@benchmark mycopy(arr2, arr, eachindex(arr2)))
BenchmarkTools.Trial: 737 samples with 1 evaluation per sample.
 Range (min  max):  6.654 ms    8.678 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     6.702 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.775 ms ± 257.431 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ██▇▅▄▃▁▁                                                     
  ████████▇██▇▆▆▄▆▆▄▁▄▁▁▁▁▄▁▁▄▄▅▄▄▅▄▄▆▄▁▄▁▁▄▄▄▁▁▁▁▄▁▁▄▅▁▄▄▅▅▄ █
  6.65 ms      Histogram: log(frequency) by time      8.08 ms <

 Memory estimate: 16 bytes, allocs estimate: 1.

julia> display(@benchmark mycopy(arr2, u_arr, eachindex(arr2)))
BenchmarkTools.Trial: 307 samples with 1 evaluation per sample.
 Range (min  max):  15.804 ms  26.397 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     15.964 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   16.306 ms ±  1.103 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▅▄▃▄▂▁                                                      
  ████████▇▇█▄▆▁▁▁▄▄▁▄▁▁▁▁▄▅▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▆
  15.8 ms      Histogram: log(frequency) by time      23.5 ms <

 Memory estimate: 16 bytes, allocs estimate: 1.

@giordano
Copy link
Contributor

julia> code_llvm(mycopy, (UnsafeArray{Float32, 1}, UnsafeArray{Float32, 1}, UnitRange{Int64}))
; Function Signature: mycopy(UnsafeArrays.UnsafeArray{Float32, 1}, UnsafeArrays.UnsafeArray{Float32, 1}, Base.UnitRange{Int64})
;  @ REPL[4]:1 within `mycopy`
define void @julia_mycopy_2803(ptr noalias nocapture noundef nonnull sret({ ptr, [1 x i64] }) align 8 dereferenceable(16) %sret_return, ptr nocapture noundef nonnull readonly align 8 dereferenceable(16) %"dest::UnsafeArray", ptr nocapture noundef nonnull readonly align 8 dereferenceable(16) %"src::UnsafeArray", ptr nocapture noundef nonnull readonly align 8 dereferenceable(16) %"iter::UnitRange") local_unnamed_addr #0 {
top:
;  @ REPL[4]:4 within `mycopy`
; ┌ @ range.jl:917 within `iterate`
; │┌ @ range.jl:688 within `isempty`
; ││┌ @ range.jl:859 within `last`
; │││┌ @ Base_compiler.jl:55 within `getproperty`
      %"iter::UnitRange.stop_ptr" = getelementptr inbounds i8, ptr %"iter::UnitRange", i64 8
; ││└└
; ││┌ @ operators.jl:425 within `>`
; │││┌ @ int.jl:83 within `<`
      %"iter::UnitRange.stop_ptr.unbox" = load i64, ptr %"iter::UnitRange.stop_ptr", align 8
      %"iter::UnitRange.unbox" = load i64, ptr %"iter::UnitRange", align 8
      %.not = icmp slt i64 %"iter::UnitRange.stop_ptr.unbox", %"iter::UnitRange.unbox"
; │└└└
   br i1 %.not, label %L64, label %L9

L9:                                               ; preds = %top
   %"src::UnsafeArray.unbox" = load ptr, ptr %"src::UnsafeArray", align 8
   %"src::UnsafeArray.unbox27" = ptrtoint ptr %"src::UnsafeArray.unbox" to i64
   %"dest::UnsafeArray.unbox" = load ptr, ptr %"dest::UnsafeArray", align 8
; └
  %"dest::UnsafeArray.unbox26" = ptrtoint ptr %"dest::UnsafeArray.unbox" to i64
  %0 = add i64 %"iter::UnitRange.stop_ptr.unbox", 1
  %1 = sub i64 %0, %"iter::UnitRange.unbox"
  %min.iters.check = icmp ult i64 %1, 16
  %2 = sub i64 %"dest::UnsafeArray.unbox26", %"src::UnsafeArray.unbox27"
  %diff.check = icmp ult i64 %2, 64
  %or.cond = select i1 %min.iters.check, i1 true, i1 %diff.check
  br i1 %or.cond, label %L14, label %vector.ph

vector.ph:                                        ; preds = %L9
  %n.vec = and i64 %1, -16
  %ind.end = add i64 %"iter::UnitRange.unbox", %n.vec
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %offset.idx = add i64 %"iter::UnitRange.unbox", %index
;  @ REPL[4]:5 within `mycopy`
; ┌ @ /Users/mose/.julia/packages/UnsafeArrays/xbma7/src/unsafe_array.jl:55 within `getindex`
; │┌ @ pointer.jl:153 within `unsafe_load`
    %3 = add i64 %offset.idx, -1
    %4 = getelementptr inbounds float, ptr %"src::UnsafeArray.unbox", i64 %3
    %5 = getelementptr inbounds i8, ptr %4, i64 16
    %6 = getelementptr inbounds i8, ptr %4, i64 32
    %7 = getelementptr inbounds i8, ptr %4, i64 48
    %wide.load = load <4 x float>, ptr %4, align 1
    %wide.load28 = load <4 x float>, ptr %5, align 1
    %wide.load29 = load <4 x float>, ptr %6, align 1
    %wide.load30 = load <4 x float>, ptr %7, align 1
; └└
; ┌ @ /Users/mose/.julia/packages/UnsafeArrays/xbma7/src/unsafe_array.jl:60 within `setindex!`
; │┌ @ pointer.jl:180 within `unsafe_store!`
    %8 = getelementptr inbounds float, ptr %"dest::UnsafeArray.unbox", i64 %3
    %9 = getelementptr inbounds i8, ptr %8, i64 16
    %10 = getelementptr inbounds i8, ptr %8, i64 32
    %11 = getelementptr inbounds i8, ptr %8, i64 48
    store <4 x float> %wide.load, ptr %8, align 1
    store <4 x float> %wide.load28, ptr %9, align 1
    store <4 x float> %wide.load29, ptr %10, align 1
    store <4 x float> %wide.load30, ptr %11, align 1
    %index.next = add nuw i64 %index, 16
    %12 = icmp eq i64 %index.next, %n.vec
    br i1 %12, label %middle.block, label %vector.body

middle.block:                                     ; preds = %vector.body
; └└
;  @ REPL[4]:6 within `mycopy`
  %cmp.n = icmp eq i64 %1, %n.vec
  br i1 %cmp.n, label %L64, label %L14

L14:                                              ; preds = %L14, %middle.block, %L9
  %value_phi4 = phi i64 [ %15, %L14 ], [ %"iter::UnitRange.unbox", %L9 ], [ %ind.end, %middle.block ]
;  @ REPL[4]:5 within `mycopy`
; ┌ @ /Users/mose/.julia/packages/UnsafeArrays/xbma7/src/unsafe_array.jl:55 within `getindex`
; │┌ @ pointer.jl:153 within `unsafe_load`
    %pointerref_idx = add i64 %value_phi4, -1
    %13 = getelementptr inbounds float, ptr %"src::UnsafeArray.unbox", i64 %pointerref_idx
    %pointerref = load float, ptr %13, align 1
; └└
; ┌ @ /Users/mose/.julia/packages/UnsafeArrays/xbma7/src/unsafe_array.jl:60 within `setindex!`
; │┌ @ pointer.jl:180 within `unsafe_store!`
    %14 = getelementptr inbounds float, ptr %"dest::UnsafeArray.unbox", i64 %pointerref_idx
    store float %pointerref, ptr %14, align 1
; └└
;  @ REPL[4]:6 within `mycopy`
; ┌ @ range.jl:921 within `iterate`
; │┌ @ promotion.jl:632 within `==`
    %.not23 = icmp eq i64 %value_phi4, %"iter::UnitRange.stop_ptr.unbox"
; │└
   %15 = add i64 %value_phi4, 1
; └
  br i1 %.not23, label %L64, label %L14

L64:                                              ; preds = %L14, %middle.block, %top
;  @ REPL[4]:7 within `mycopy`
  call void @llvm.memcpy.p0.p0.i64(ptr noundef nonnull align 8 dereferenceable(16) %sret_return, ptr noundef nonnull align 8 dereferenceable(16) %"dest::UnsafeArray", i64 16, i1 false)
  ret void
}

That looks vectorised to me.

@giordano
Copy link
Contributor

Ah, wait, the problem is between different types:

julia> code_llvm(mycopy, (Array{Float32, 1}, UnsafeArray{Float32, 1}, UnitRange{Int64}))
; Function Signature: mycopy(Array{Float32, 1}, UnsafeArrays.UnsafeArray{Float32, 1}, Base.UnitRange{Int64})
;  @ REPL[4]:1 within `mycopy`
define nonnull ptr @julia_mycopy_2852(ptr noundef nonnull align 8 dereferenceable(24) %"dest::Array", ptr nocapture noundef nonnull readonly align 8 dereferenceable(16) %"src::UnsafeArray", ptr nocapture noundef nonnull readonly align 8 dereferenceable(16) %"iter::UnitRange") local_unnamed_addr #0 {
top:
;  @ REPL[4]:4 within `mycopy`
; ┌ @ range.jl:917 within `iterate`
; │┌ @ range.jl:688 within `isempty`
; ││┌ @ range.jl:859 within `last`
; │││┌ @ Base_compiler.jl:55 within `getproperty`
      %"iter::UnitRange.stop_ptr" = getelementptr inbounds i8, ptr %"iter::UnitRange", i64 8
; ││└└
; ││┌ @ operators.jl:425 within `>`
; │││┌ @ int.jl:83 within `<`
      %"iter::UnitRange.stop_ptr.unbox" = load i64, ptr %"iter::UnitRange.stop_ptr", align 8
      %"iter::UnitRange.unbox" = load i64, ptr %"iter::UnitRange", align 8
      %.not = icmp slt i64 %"iter::UnitRange.stop_ptr.unbox", %"iter::UnitRange.unbox"
; │└└└
   br i1 %.not, label %L65, label %L9

L9:                                               ; preds = %top
   %"src::UnsafeArray.unbox" = load ptr, ptr %"src::UnsafeArray", align 8
   %memoryref_data = load ptr, ptr %"dest::Array", align 8
; └
  br label %L14

L14:                                              ; preds = %L14, %L9
  %value_phi4 = phi i64 [ %"iter::UnitRange.unbox", %L9 ], [ %1, %L14 ]
;  @ REPL[4]:5 within `mycopy`
; ┌ @ /Users/mose/.julia/packages/UnsafeArrays/xbma7/src/unsafe_array.jl:55 within `getindex`
; │┌ @ pointer.jl:153 within `unsafe_load`
    %pointerref_idx = add i64 %value_phi4, -1
    %0 = getelementptr inbounds float, ptr %"src::UnsafeArray.unbox", i64 %pointerref_idx
    %pointerref = load float, ptr %0, align 1
; └└
; ┌ @ array.jl:986 within `setindex!`
; │┌ @ array.jl:991 within `_setindex!`
    %memoryref_data6 = getelementptr inbounds float, ptr %memoryref_data, i64 %pointerref_idx
    store float %pointerref, ptr %memoryref_data6, align 4
; └└
;  @ REPL[4]:6 within `mycopy`
; ┌ @ range.jl:921 within `iterate`
; │┌ @ promotion.jl:632 within `==`
    %.not24 = icmp eq i64 %value_phi4, %"iter::UnitRange.stop_ptr.unbox"
; │└
   %1 = add i64 %value_phi4, 1
; └
  br i1 %.not24, label %L65, label %L14

L65:                                              ; preds = %L14, %top
;  @ REPL[4]:7 within `mycopy`
  ret ptr %"dest::Array"
}

@ericphanson
Copy link
Contributor

Ah, wait, the problem is between different types:

That feature is also present in @Octogonapus's original example, where a = AllocArray(arr) has a different backing than c = similar(a) and results in a different type parameter. Maybe we can workaround it by making both backed by unsafe arrays?

e.g.

using AllocArrays, BenchmarkTools, UnsafeArrays

function mycopy(dest, src, iter)
    # precondition: dest and src do not alias
    # precondition: the iterators of dest and src are equal
    for i in iter
        @inbounds dest[i] = src[i]
    end
    return dest
end

b = BumperAllocator(2^30);
arr = rand(Float32, 50000000);
arr2 = similar(arr);
a = AllocArray(arr);

println("With mycopy(arr2, arr, eachindex(arr2)):")
display(@benchmark mycopy(arr2, arr, eachindex(arr2)))

println("With a = AllocArray(a)")
display(@benchmark with_allocator(b) do
    c = similar(a)
    mycopy(c, a, eachindex(c))
end setup=(reset!(b)) evals=1)

println("With a_unsafe = AllocArray(UnsafeArray(pointer(a), size(a)))")
a_unsafe = AllocArray(UnsafeArray(pointer(a), size(a)))
display(@benchmark with_allocator(b) do
    c = similar(a_unsafe)
    mycopy(c, a_unsafe, eachindex(c))
end setup=(reset!(b)) evals=1)

gives

mycopy(arr2, arr, eachindex(arr2))
BenchmarkTools.Trial: 645 samples with 1 evaluation per sample.
 Range (min  max):  6.647 ms  39.328 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     6.839 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.743 ms ±  2.048 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▅▃▁  ▁              ▁▁                                     
  ███████▇███▇▇█▆▇▄▇▆▇███▇▇▆▇▇▆█▇▇▇█▇▆▆▅▁▅▅▄▆▅▄▄▄▅▁▁▁▁▁▁▁▁▄▄ ▇
  6.65 ms      Histogram: log(frequency) by time     13.1 ms <

 Memory estimate: 16 bytes, allocs estimate: 1.
With a = AllocArray(a)
BenchmarkTools.Trial: 310 samples with 1 evaluation per sample.
 Range (min  max):  15.977 ms   17.831 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     16.065 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   16.156 ms ± 233.459 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▇█▅▅▃                                                        
  ██████▅▅█▄▆▆▅▃▃▅▃▃▃▁▃▃▂▃▂▁▂▁▃▁▃▃▃▁▃▃▃▁▃▁▁▃▁▁▁▂▂▂▁▃▁▂▁▁▁▂▁▁▂▂ ▃
  16 ms           Histogram: frequency by time           17 ms <

 Memory estimate: 368 bytes, allocs estimate: 13.
With a_unsafe = AllocArray(UnsafeArray(pointer(a), size(a)))
BenchmarkTools.Trial: 716 samples with 1 evaluation per sample.
 Range (min  max):  6.656 ms  51.063 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     6.763 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.973 ms ±  1.706 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ██▇▆▅▄▃▄▃▃▄▂▁▁        ▁                                     
  ███████████████▇▇▆▆▇▅██▇▇▇▇▇█▇▇▆▅▇▇▆▅▇██▆▄▅▄▅▄▄▄▅▄▄▄▄▁▁▄▁▄ █
  6.66 ms      Histogram: log(frequency) by time     8.14 ms <

 Memory estimate: 384 bytes, allocs estimate: 13.

@gbaraldi
Copy link
Member

The issue here is that LLVM can't legally emit a runtime alias check between a addrspace(13) pointer (the memory pointer) and a regular pointer (the UnsafeArrays.jl pointer). And because it can't prove that they don't alias (pointer is a thing sadly) it can't emit the vectorized loop

@ericphanson
Copy link
Contributor

an update here: my workaround to use unsafearrays isn't very good bc it's quite tricky to manually manage the lifetime of the backing array correctly. However, I don't think this bug is a big performance issue in the AllocArrays+ObjectDetector case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:llvm For issues that relate to LLVM performance Must go faster
Projects
None yet
Development

No branches or pull requests

5 participants