-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missed autovectorization on array copy using AllocArrays.jl #57799
Comments
AllocArrays v0.1.1 also includes some weird dispatch due to the latest compatible version of Bumper.jl depending on StrideArrraysCore.jl, but even without that disruptive package the generated code is the same. |
Ok, this is a known issue we have currently with the gc_loaded intrinsic. I need to refactor that code (which is not super simple) but it should allow for the optimizations to happen |
20 ms vs 14 ms doesn't sound to be only due to missed vectorisation (you'd see a factor of some integer multiple of 2 of difference if that was the main bottleneck), maybe the code is more complicated overall? |
fwiw here is a reduction to UnsafeArrays which is used under the hood: using UnsafeArrays, BenchmarkTools
function mycopy(dest, src, iter)
# precondition: dest and src do not alias
# precondition: the iterators of dest and src are equal
for i in iter
@inbounds dest[i] = src[i]
end
return dest
end
arr = rand(Float32, 50000000);
arr2 = similar(arr);
display(@benchmark mycopy(arr2, arr, eachindex(arr2)))
arr3 = copy(arr)
u_arr = UnsafeArray(pointer(arr3), size(arr3))
display(@benchmark mycopy(arr2, u_arr, eachindex(arr2))) which gives on my m1 laptop julia> display(@benchmark mycopy(arr2, arr, eachindex(arr2)))
BenchmarkTools.Trial: 737 samples with 1 evaluation per sample.
Range (min … max): 6.654 ms … 8.678 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 6.702 ms ┊ GC (median): 0.00%
Time (mean ± σ): 6.775 ms ± 257.431 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
██▇▅▄▃▁▁
████████▇██▇▆▆▄▆▆▄▁▄▁▁▁▁▄▁▁▄▄▅▄▄▅▄▄▆▄▁▄▁▁▄▄▄▁▁▁▁▄▁▁▄▅▁▄▄▅▅▄ █
6.65 ms Histogram: log(frequency) by time 8.08 ms <
Memory estimate: 16 bytes, allocs estimate: 1.
julia> display(@benchmark mycopy(arr2, u_arr, eachindex(arr2)))
BenchmarkTools.Trial: 307 samples with 1 evaluation per sample.
Range (min … max): 15.804 ms … 26.397 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 15.964 ms ┊ GC (median): 0.00%
Time (mean ± σ): 16.306 ms ± 1.103 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
█▅▄▃▄▂▁
████████▇▇█▄▆▁▁▁▄▄▁▄▁▁▁▁▄▅▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▆
15.8 ms Histogram: log(frequency) by time 23.5 ms <
Memory estimate: 16 bytes, allocs estimate: 1. |
julia> code_llvm(mycopy, (UnsafeArray{Float32, 1}, UnsafeArray{Float32, 1}, UnitRange{Int64})) ; Function Signature: mycopy(UnsafeArrays.UnsafeArray{Float32, 1}, UnsafeArrays.UnsafeArray{Float32, 1}, Base.UnitRange{Int64})
; @ REPL[4]:1 within `mycopy`
define void @julia_mycopy_2803(ptr noalias nocapture noundef nonnull sret({ ptr, [1 x i64] }) align 8 dereferenceable(16) %sret_return, ptr nocapture noundef nonnull readonly align 8 dereferenceable(16) %"dest::UnsafeArray", ptr nocapture noundef nonnull readonly align 8 dereferenceable(16) %"src::UnsafeArray", ptr nocapture noundef nonnull readonly align 8 dereferenceable(16) %"iter::UnitRange") local_unnamed_addr #0 {
top:
; @ REPL[4]:4 within `mycopy`
; ┌ @ range.jl:917 within `iterate`
; │┌ @ range.jl:688 within `isempty`
; ││┌ @ range.jl:859 within `last`
; │││┌ @ Base_compiler.jl:55 within `getproperty`
%"iter::UnitRange.stop_ptr" = getelementptr inbounds i8, ptr %"iter::UnitRange", i64 8
; ││└└
; ││┌ @ operators.jl:425 within `>`
; │││┌ @ int.jl:83 within `<`
%"iter::UnitRange.stop_ptr.unbox" = load i64, ptr %"iter::UnitRange.stop_ptr", align 8
%"iter::UnitRange.unbox" = load i64, ptr %"iter::UnitRange", align 8
%.not = icmp slt i64 %"iter::UnitRange.stop_ptr.unbox", %"iter::UnitRange.unbox"
; │└└└
br i1 %.not, label %L64, label %L9
L9: ; preds = %top
%"src::UnsafeArray.unbox" = load ptr, ptr %"src::UnsafeArray", align 8
%"src::UnsafeArray.unbox27" = ptrtoint ptr %"src::UnsafeArray.unbox" to i64
%"dest::UnsafeArray.unbox" = load ptr, ptr %"dest::UnsafeArray", align 8
; └
%"dest::UnsafeArray.unbox26" = ptrtoint ptr %"dest::UnsafeArray.unbox" to i64
%0 = add i64 %"iter::UnitRange.stop_ptr.unbox", 1
%1 = sub i64 %0, %"iter::UnitRange.unbox"
%min.iters.check = icmp ult i64 %1, 16
%2 = sub i64 %"dest::UnsafeArray.unbox26", %"src::UnsafeArray.unbox27"
%diff.check = icmp ult i64 %2, 64
%or.cond = select i1 %min.iters.check, i1 true, i1 %diff.check
br i1 %or.cond, label %L14, label %vector.ph
vector.ph: ; preds = %L9
%n.vec = and i64 %1, -16
%ind.end = add i64 %"iter::UnitRange.unbox", %n.vec
br label %vector.body
vector.body: ; preds = %vector.body, %vector.ph
%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%offset.idx = add i64 %"iter::UnitRange.unbox", %index
; @ REPL[4]:5 within `mycopy`
; ┌ @ /Users/mose/.julia/packages/UnsafeArrays/xbma7/src/unsafe_array.jl:55 within `getindex`
; │┌ @ pointer.jl:153 within `unsafe_load`
%3 = add i64 %offset.idx, -1
%4 = getelementptr inbounds float, ptr %"src::UnsafeArray.unbox", i64 %3
%5 = getelementptr inbounds i8, ptr %4, i64 16
%6 = getelementptr inbounds i8, ptr %4, i64 32
%7 = getelementptr inbounds i8, ptr %4, i64 48
%wide.load = load <4 x float>, ptr %4, align 1
%wide.load28 = load <4 x float>, ptr %5, align 1
%wide.load29 = load <4 x float>, ptr %6, align 1
%wide.load30 = load <4 x float>, ptr %7, align 1
; └└
; ┌ @ /Users/mose/.julia/packages/UnsafeArrays/xbma7/src/unsafe_array.jl:60 within `setindex!`
; │┌ @ pointer.jl:180 within `unsafe_store!`
%8 = getelementptr inbounds float, ptr %"dest::UnsafeArray.unbox", i64 %3
%9 = getelementptr inbounds i8, ptr %8, i64 16
%10 = getelementptr inbounds i8, ptr %8, i64 32
%11 = getelementptr inbounds i8, ptr %8, i64 48
store <4 x float> %wide.load, ptr %8, align 1
store <4 x float> %wide.load28, ptr %9, align 1
store <4 x float> %wide.load29, ptr %10, align 1
store <4 x float> %wide.load30, ptr %11, align 1
%index.next = add nuw i64 %index, 16
%12 = icmp eq i64 %index.next, %n.vec
br i1 %12, label %middle.block, label %vector.body
middle.block: ; preds = %vector.body
; └└
; @ REPL[4]:6 within `mycopy`
%cmp.n = icmp eq i64 %1, %n.vec
br i1 %cmp.n, label %L64, label %L14
L14: ; preds = %L14, %middle.block, %L9
%value_phi4 = phi i64 [ %15, %L14 ], [ %"iter::UnitRange.unbox", %L9 ], [ %ind.end, %middle.block ]
; @ REPL[4]:5 within `mycopy`
; ┌ @ /Users/mose/.julia/packages/UnsafeArrays/xbma7/src/unsafe_array.jl:55 within `getindex`
; │┌ @ pointer.jl:153 within `unsafe_load`
%pointerref_idx = add i64 %value_phi4, -1
%13 = getelementptr inbounds float, ptr %"src::UnsafeArray.unbox", i64 %pointerref_idx
%pointerref = load float, ptr %13, align 1
; └└
; ┌ @ /Users/mose/.julia/packages/UnsafeArrays/xbma7/src/unsafe_array.jl:60 within `setindex!`
; │┌ @ pointer.jl:180 within `unsafe_store!`
%14 = getelementptr inbounds float, ptr %"dest::UnsafeArray.unbox", i64 %pointerref_idx
store float %pointerref, ptr %14, align 1
; └└
; @ REPL[4]:6 within `mycopy`
; ┌ @ range.jl:921 within `iterate`
; │┌ @ promotion.jl:632 within `==`
%.not23 = icmp eq i64 %value_phi4, %"iter::UnitRange.stop_ptr.unbox"
; │└
%15 = add i64 %value_phi4, 1
; └
br i1 %.not23, label %L64, label %L14
L64: ; preds = %L14, %middle.block, %top
; @ REPL[4]:7 within `mycopy`
call void @llvm.memcpy.p0.p0.i64(ptr noundef nonnull align 8 dereferenceable(16) %sret_return, ptr noundef nonnull align 8 dereferenceable(16) %"dest::UnsafeArray", i64 16, i1 false)
ret void
} That looks vectorised to me. |
Ah, wait, the problem is between different types: julia> code_llvm(mycopy, (Array{Float32, 1}, UnsafeArray{Float32, 1}, UnitRange{Int64})) ; Function Signature: mycopy(Array{Float32, 1}, UnsafeArrays.UnsafeArray{Float32, 1}, Base.UnitRange{Int64})
; @ REPL[4]:1 within `mycopy`
define nonnull ptr @julia_mycopy_2852(ptr noundef nonnull align 8 dereferenceable(24) %"dest::Array", ptr nocapture noundef nonnull readonly align 8 dereferenceable(16) %"src::UnsafeArray", ptr nocapture noundef nonnull readonly align 8 dereferenceable(16) %"iter::UnitRange") local_unnamed_addr #0 {
top:
; @ REPL[4]:4 within `mycopy`
; ┌ @ range.jl:917 within `iterate`
; │┌ @ range.jl:688 within `isempty`
; ││┌ @ range.jl:859 within `last`
; │││┌ @ Base_compiler.jl:55 within `getproperty`
%"iter::UnitRange.stop_ptr" = getelementptr inbounds i8, ptr %"iter::UnitRange", i64 8
; ││└└
; ││┌ @ operators.jl:425 within `>`
; │││┌ @ int.jl:83 within `<`
%"iter::UnitRange.stop_ptr.unbox" = load i64, ptr %"iter::UnitRange.stop_ptr", align 8
%"iter::UnitRange.unbox" = load i64, ptr %"iter::UnitRange", align 8
%.not = icmp slt i64 %"iter::UnitRange.stop_ptr.unbox", %"iter::UnitRange.unbox"
; │└└└
br i1 %.not, label %L65, label %L9
L9: ; preds = %top
%"src::UnsafeArray.unbox" = load ptr, ptr %"src::UnsafeArray", align 8
%memoryref_data = load ptr, ptr %"dest::Array", align 8
; └
br label %L14
L14: ; preds = %L14, %L9
%value_phi4 = phi i64 [ %"iter::UnitRange.unbox", %L9 ], [ %1, %L14 ]
; @ REPL[4]:5 within `mycopy`
; ┌ @ /Users/mose/.julia/packages/UnsafeArrays/xbma7/src/unsafe_array.jl:55 within `getindex`
; │┌ @ pointer.jl:153 within `unsafe_load`
%pointerref_idx = add i64 %value_phi4, -1
%0 = getelementptr inbounds float, ptr %"src::UnsafeArray.unbox", i64 %pointerref_idx
%pointerref = load float, ptr %0, align 1
; └└
; ┌ @ array.jl:986 within `setindex!`
; │┌ @ array.jl:991 within `_setindex!`
%memoryref_data6 = getelementptr inbounds float, ptr %memoryref_data, i64 %pointerref_idx
store float %pointerref, ptr %memoryref_data6, align 4
; └└
; @ REPL[4]:6 within `mycopy`
; ┌ @ range.jl:921 within `iterate`
; │┌ @ promotion.jl:632 within `==`
%.not24 = icmp eq i64 %value_phi4, %"iter::UnitRange.stop_ptr.unbox"
; │└
%1 = add i64 %value_phi4, 1
; └
br i1 %.not24, label %L65, label %L14
L65: ; preds = %L14, %top
; @ REPL[4]:7 within `mycopy`
ret ptr %"dest::Array"
} |
That feature is also present in @Octogonapus's original example, where e.g. using AllocArrays, BenchmarkTools, UnsafeArrays
function mycopy(dest, src, iter)
# precondition: dest and src do not alias
# precondition: the iterators of dest and src are equal
for i in iter
@inbounds dest[i] = src[i]
end
return dest
end
b = BumperAllocator(2^30);
arr = rand(Float32, 50000000);
arr2 = similar(arr);
a = AllocArray(arr);
println("With mycopy(arr2, arr, eachindex(arr2)):")
display(@benchmark mycopy(arr2, arr, eachindex(arr2)))
println("With a = AllocArray(a)")
display(@benchmark with_allocator(b) do
c = similar(a)
mycopy(c, a, eachindex(c))
end setup=(reset!(b)) evals=1)
println("With a_unsafe = AllocArray(UnsafeArray(pointer(a), size(a)))")
a_unsafe = AllocArray(UnsafeArray(pointer(a), size(a)))
display(@benchmark with_allocator(b) do
c = similar(a_unsafe)
mycopy(c, a_unsafe, eachindex(c))
end setup=(reset!(b)) evals=1)
gives mycopy(arr2, arr, eachindex(arr2))
BenchmarkTools.Trial: 645 samples with 1 evaluation per sample.
Range (min … max): 6.647 ms … 39.328 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 6.839 ms ┊ GC (median): 0.00%
Time (mean ± σ): 7.743 ms ± 2.048 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
█▅▃▁ ▁ ▁▁
███████▇███▇▇█▆▇▄▇▆▇███▇▇▆▇▇▆█▇▇▇█▇▆▆▅▁▅▅▄▆▅▄▄▄▅▁▁▁▁▁▁▁▁▄▄ ▇
6.65 ms Histogram: log(frequency) by time 13.1 ms <
Memory estimate: 16 bytes, allocs estimate: 1.
With a = AllocArray(a)
BenchmarkTools.Trial: 310 samples with 1 evaluation per sample.
Range (min … max): 15.977 ms … 17.831 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 16.065 ms ┊ GC (median): 0.00%
Time (mean ± σ): 16.156 ms ± 233.459 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▇█▅▅▃
██████▅▅█▄▆▆▅▃▃▅▃▃▃▁▃▃▂▃▂▁▂▁▃▁▃▃▃▁▃▃▃▁▃▁▁▃▁▁▁▂▂▂▁▃▁▂▁▁▁▂▁▁▂▂ ▃
16 ms Histogram: frequency by time 17 ms <
Memory estimate: 368 bytes, allocs estimate: 13.
With a_unsafe = AllocArray(UnsafeArray(pointer(a), size(a)))
BenchmarkTools.Trial: 716 samples with 1 evaluation per sample.
Range (min … max): 6.656 ms … 51.063 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 6.763 ms ┊ GC (median): 0.00%
Time (mean ± σ): 6.973 ms ± 1.706 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
██▇▆▅▄▃▄▃▃▄▂▁▁ ▁
███████████████▇▇▆▆▇▅██▇▇▇▇▇█▇▇▆▅▇▇▆▅▇██▆▄▅▄▅▄▄▄▅▄▄▄▄▁▁▄▁▄ █
6.66 ms Histogram: log(frequency) by time 8.14 ms <
Memory estimate: 384 bytes, allocs estimate: 13. |
The issue here is that LLVM can't legally emit a runtime alias check between a addrspace(13) pointer (the memory pointer) and a regular pointer (the UnsafeArrays.jl pointer). And because it can't prove that they don't alias ( |
an update here: my workaround to use unsafearrays isn't very good bc it's quite tricky to manually manage the lifetime of the backing array correctly. However, I don't think this bug is a big performance issue in the AllocArrays+ObjectDetector case. |
On the latest master (as of 3e7cf64), autovectorization fails when using AllocArrays.jl v0.1.1 (latest) for an array copy:
We can observe the effect on performance:
Here is the LLVM code for the base case: https://pastebin.com/gCbp1uEX
and for the AllocArrays case: https://pastebin.com/bKYmswnd
The text was updated successfully, but these errors were encountered: