-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
shuffle! is as much as 2x slower than a naive implementation #57771
Comments
Relevant comments from @jakobnissen on Slack:
|
How do speeds compare on smaller lists? I suspect our current implementation may be better for small sizes while for large sizes it will be entirely memory latency constrained |
I don't think that's quite it, I see a ~2x perf difference across a wide range of sizes (numbers are a little different from earlier since I'm using a different machine): julia> @b rand(10) shuffle!, myshuf!
(57.758 ns, 34.702 ns)
julia> @b rand(100) shuffle!, myshuf!
(575.500 ns, 336.250 ns)
julia> @b rand(1000) shuffle!, myshuf!
(5.420 μs, 3.256 μs)
julia> @b rand(10000) shuffle!, myshuf!
(62.638 μs, 33.163 μs)
julia> @b rand(100000) shuffle!, myshuf!
(650.934 μs, 384.339 μs) |
continuing your benchmarks, seems like for very big sizes julia> @b rand(1000000) shuffle!, myshuf!
(5.386 ms, 4.846 ms)
julia> @b rand(10000000) shuffle!, myshuf!
(320.834 ms (without a warmup), 358.417 ms (without a warmup))
julia> @b rand(100000000) shuffle!, myshuf!
(2.577 s (without a warmup), 3.672 s (without a warmup)) |
if you set the rng it harms at big sizes both with julia> function myshuf!(rng, vec)
for i in eachindex(vec)
j = rand(rng, i:length(vec))
vec[i], vec[j] = vec[j], vec[i]
end
vec
end
myshuf! (generic function with 2 methods)
julia> rng = Xoshiro(44)
Xoshiro(0x9b2cf7b0c54d4332, 0x9be7d4a5da243c86, 0x9f468cb373bd439c, 0x327b6b993dd15fb6, 0xeca526544725e8ca)
julia> @b rand(10) shuffle!($rng, _), myshuf!($rng, _)
(37.881 ns, 37.456 ns)
julia> @b rand(100000) shuffle!($rng, _), myshuf!($rng, _)
(389.189 μs, 389.811 μs)
julia> @b rand(10000000) shuffle!($rng, _), myshuf!($rng, _)
(211.260 ms (without a warmup), 364.364 ms (without a warmup))
julia> rng = MersenneTwister(42)
MersenneTwister(42)
julia> @b rand(10) shuffle!($rng, _), myshuf!($rng, _)
(53.326 ns, 48.015 ns)
julia> @b rand(100000) shuffle!($rng, _), myshuf!($rng, _)
(648.531 μs, 505.550 μs)
julia> @b rand(10000000) shuffle!($rng, _), myshuf!($rng, _)
(212.348 ms (without a warmup), 376.028 ms (without a warmup))
|
I suspect this may be machine dependent, e.g. on my end: julia> @b rand(100000000) shuffle!, myshuf!
(3.003 s (without a warmup), 2.861 s (without a warmup))
Interesting, here's what I get: julia> rng = Random.default_rng()
TaskLocalRNG()
julia> @b rand(10) shuffle!($rng, _), myshuf!($rng, _)
(57.634 ns, 35.446 ns)
julia> @b rand(100000) shuffle!($rng, _), myshuf!($rng, _)
(662.496 μs, 384.037 μs)
julia> rng = Xoshiro(123)
Xoshiro(0xfefa8d41b8f5dca5, 0xf80cc98e147960c1, 0x20e2ccc17662fc1d, 0xea7a7dcb2e787c01, 0xf4e85a418b9c4f80)
julia> @b rand(10) shuffle!($rng, _), myshuf!($rng, _)
(40.902 ns, 41.639 ns)
julia> @b rand(100000) shuffle!($rng, _), myshuf!($rng, _)
(435.856 μs, 427.771 μs)
julia> @b rand(10000000) shuffle!($rng, _), myshuf!($rng, _)
(193.804 ms (without a warmup), 254.046 ms (without a warmup)) I'm not sure that there's a clear overall conclusion here. I thought that |
So, using the following function (which awkwardly manually inlines): function myshuf2!(rng::AbstractRNG, vec::AbstractVector)
Base.require_one_based_indexing(vec)
for i in 2:length(vec)
# Make sure to inline all the indirection in sampling. Sampling from UInt range
# is slightly faster.
j = @inline rand(rng, Random.Sampler(rng, UInt(0):(i-1)%UInt, Val(1))) % Int + 1
vec[i], vec[j] = vec[j], vec[i]
end
vec
end I get the following timings:
Removing the |
On a related note it seems like we can squeeze some more performance from julia> using Random, BenchmarkTools
julia> rng = Xoshiro(42)
julia> f(rng, i) = @inline rand(rng, Random.Sampler(rng, 1:i, Val(1)))
f (generic function with 2 methods)
julia> g(rng, i) = rand(rng, 1:i)
g (generic function with 1 method)
julia> @benchmark f($rng, $10)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations per sample.
Range (min … max): 4.017 ns … 49.573 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 4.028 ns ┊ GC (median): 0.00%
Time (mean ± σ): 4.059 ns ± 0.529 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█
▂▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▅▅▁▁▁▁▁▁▂▂▁▁▁▁▁▁▁▂▁▁▁▁▁▁▂▂▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▂ ▂
4.02 ns Histogram: frequency by time 4.09 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark g($rng, $10)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations per sample.
Range (min … max): 4.969 ns … 33.442 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 4.980 ns ┊ GC (median): 0.00%
Time (mean ± σ): 5.011 ns ± 0.386 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█ ▃ ▂
█▁▁▁▁▁█▅▁▁▁▁▂▂▁▁▁▁▁█▁▁▁▁▁▆▄▁▁▁▁▃▃▁▁▁▁▁▃▁▁▁▁▁▂▂▁▁▁▁▃▂▁▁▁▁▁▃ ▂
4.97 ns Histogram: frequency by time 5.06 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
|
Yeah that's a good idea. Probably the solution here is not to write a custom method, but instead to make sure all these helper functions (e.g. instantiating |
I suspect that changes in Julia's random number generation has caused micro-optimisations in the implementation of
shuffle!
to harm rather than help performance.julia/stdlib/Random/src/misc.jl
Lines 208 to 221 in 4db8c1b
On my machine, I observe a naive Fisher-Yates shuffle outperforms an array of 10,000 floats by a factor of ~2 (x86_64 Linux, with Julia 1.11.4)
The text was updated successfully, but these errors were encountered: