Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shuffle! is as much as 2x slower than a naive implementation #57771

Open
tecosaur opened this issue Mar 14, 2025 · 9 comments
Open

shuffle! is as much as 2x slower than a naive implementation #57771

tecosaur opened this issue Mar 14, 2025 · 9 comments
Labels
performance Must go faster randomness Random number generation and the Random stdlib

Comments

@tecosaur
Copy link
Contributor

I suspect that changes in Julia's random number generation has caused micro-optimisations in the implementation of shuffle! to harm rather than help performance.

function shuffle!(r::AbstractRNG, a::AbstractArray)
# keep it consistent with `randperm!` and `randcycle!` if possible
require_one_based_indexing(a)
n = length(a)
@assert n <= Int64(2)^52
n == 0 && return a
mask = 3
@inbounds for i = 2:n
j = 1 + rand(r, ltm52(i, mask))
a[i], a[j] = a[j], a[i]
i == 1 + mask && (mask = 2 * mask + 1)
end
return a
end

On my machine, I observe a naive Fisher-Yates shuffle outperforms an array of 10,000 floats by a factor of ~2 (x86_64 Linux, with Julia 1.11.4)

julia> function myshuf!(vec)
           for i in eachindex(vec)
               j = rand(i:length(vec))
               vec[i], vec[j] = vec[j], vec[i]
           end
           vec
       end
myshuf! (generic function with 1 method)

julia> vec = rand(10000);

julia> @b shuffle!(vec)
40.325 μs

julia> @b myshuf!(vec)
20.085 μs
@tecosaur tecosaur added performance Must go faster randomness Random number generation and the Random stdlib labels Mar 14, 2025
@tecosaur
Copy link
Contributor Author

Relevant comments from @jakobnissen on Slack:

Ah, I think I know why. This Random.ltm52 is an optimisation that only makes sense for MersenneTwister since it produces 52 bits at a time.

Timothy will you make an issue? Basically search for uses of Random.ltm52 and check if it slows things down, and if it does, delete it entirely.

Urgh, no, your function is still faster on Julia 1.6 before Xoshiro. 🤷
Even weirder, just sampling from this Random.LessThan is slower than sampling from a range which makes me wonder why it exists at all

@oscardssmith
Copy link
Member

How do speeds compare on smaller lists? I suspect our current implementation may be better for small sizes while for large sizes it will be entirely memory latency constrained

@tecosaur
Copy link
Contributor Author

tecosaur commented Mar 14, 2025

I don't think that's quite it, I see a ~2x perf difference across a wide range of sizes (numbers are a little different from earlier since I'm using a different machine):

julia> @b rand(10) shuffle!, myshuf!
(57.758 ns, 34.702 ns)

julia> @b rand(100) shuffle!, myshuf!
(575.500 ns, 336.250 ns)

julia> @b rand(1000) shuffle!, myshuf!
(5.420 μs, 3.256 μs)

julia> @b rand(10000) shuffle!, myshuf!
(62.638 μs, 33.163 μs)

julia> @b rand(100000) shuffle!, myshuf!
(650.934 μs, 384.339 μs)

@Tortar
Copy link
Contributor

Tortar commented Mar 14, 2025

continuing your benchmarks, seems like for very big sizes myshuf! becomes slower at a certain point:

julia> @b rand(1000000) shuffle!, myshuf!
(5.386 ms, 4.846 ms)

julia> @b rand(10000000) shuffle!, myshuf!
(320.834 ms (without a warmup), 358.417 ms (without a warmup))

julia> @b rand(100000000) shuffle!, myshuf!
(2.577 s (without a warmup), 3.672 s (without a warmup))

@Tortar
Copy link
Contributor

Tortar commented Mar 14, 2025

if you set the rng it harms at big sizes both with Xoshiro and MersenneTwister and it helps only with MersenneTwister on small sizes on my machine:

julia> function myshuf!(rng, vec)
           for i in eachindex(vec)
               j = rand(rng, i:length(vec))
               vec[i], vec[j] = vec[j], vec[i]
           end
           vec
       end
myshuf! (generic function with 2 methods)

julia> rng  = Xoshiro(44)
Xoshiro(0x9b2cf7b0c54d4332, 0x9be7d4a5da243c86, 0x9f468cb373bd439c, 0x327b6b993dd15fb6, 0xeca526544725e8ca)

julia> @b rand(10) shuffle!($rng, _), myshuf!($rng, _)
(37.881 ns, 37.456 ns)

julia> @b rand(100000) shuffle!($rng, _), myshuf!($rng, _)
(389.189 μs, 389.811 μs)

julia> @b rand(10000000) shuffle!($rng, _), myshuf!($rng, _)
(211.260 ms (without a warmup), 364.364 ms (without a warmup))

julia> rng = MersenneTwister(42)
MersenneTwister(42)

julia> @b rand(10) shuffle!($rng, _), myshuf!($rng, _)
(53.326 ns, 48.015 ns)

julia> @b rand(100000) shuffle!($rng, _), myshuf!($rng, _)
(648.531 μs, 505.550 μs)

julia> @b rand(10000000) shuffle!($rng, _), myshuf!($rng, _)
(212.348 ms (without a warmup), 376.028 ms (without a warmup))

@tecosaur
Copy link
Contributor Author

continuing your benchmarks, seems like for very big sizes myshuf! becomes slower at a certain point:

I suspect this may be machine dependent, e.g. on my end:

julia> @b rand(100000000) shuffle!, myshuf!
(3.003 s (without a warmup), 2.861 s (without a warmup))

if you set the rng it harms at big sizes both with Xoshiro and MersenneTwister and it helps only with MersenneTwister on small sizes on my machine:

Interesting, here's what I get:

julia> rng = Random.default_rng()
TaskLocalRNG()

julia> @b rand(10) shuffle!($rng, _), myshuf!($rng, _)
(57.634 ns, 35.446 ns)

julia> @b rand(100000) shuffle!($rng, _), myshuf!($rng, _)
(662.496 μs, 384.037 μs)

julia> rng = Xoshiro(123)
Xoshiro(0xfefa8d41b8f5dca5, 0xf80cc98e147960c1, 0x20e2ccc17662fc1d, 0xea7a7dcb2e787c01, 0xf4e85a418b9c4f80)

julia> @b rand(10) shuffle!($rng, _), myshuf!($rng, _)
(40.902 ns, 41.639 ns)

julia> @b rand(100000) shuffle!($rng, _), myshuf!($rng, _)
(435.856 μs, 427.771 μs)

julia> @b rand(10000000) shuffle!($rng, _), myshuf!($rng, _)
(193.804 ms (without a warmup), 254.046 ms (without a warmup))

I'm not sure that there's a clear overall conclusion here. I thought that TaskLocalRNG was a Xoshiro instance, but an explicit Xoshiro RNG is faster/slower depending on the shuffle implementation??

@jakobnissen
Copy link
Member

jakobnissen commented Mar 17, 2025

TaskLocalRNG is a Xoshiro, but it's stored inline in the current task, so there is a little extra overhead in getting the current task and loading it from it.
From a little investigation, I find that:

  • The use of Random.ltm52 consistently makes things slower compared to sampling from a range. This object should be deleted.
  • For long vectors, the iteration order matters. Sampling from 1:i is better than from i:end, since the former is more likely to be a cache hit. The former is what shuffle! does. I don't know if what shuffle! does is statistically unbiased. The Fisher-Yates algorithm does what myshuf! does, not what shuffle! does. From a quick glance on some test values, it doesn't look like there is a bias.
  • Inlining matters. On my computer on Julia master, the RNG in shuffle! inlines, but not myshuf!. Part of this is due to manual inlining annotations on Random.ltm52, which we also can use when sampling from a UnitRange. Be careful of this when benchmarking sampling from a Random.ltm52 versus a UnitRange.
  • shuffle! uses @inbounds (probably wrongly).

So, using the following function (which awkwardly manually inlines):

function myshuf2!(rng::AbstractRNG, vec::AbstractVector)
    Base.require_one_based_indexing(vec)
    for i in 2:length(vec)
        # Make sure to inline all the indirection in sampling. Sampling from UInt range
        # is slightly faster.
        j = @inline rand(rng, Random.Sampler(rng, UInt(0):(i-1)%UInt, Val(1))) % Int + 1
        vec[i], vec[j] = vec[j], vec[i]
    end
    vec
end

I get the following timings:

julia> @b rand(10) shuffle!($rng, _), myshuf!($rng, _), myshuf2!($rng, _)
(45.813 ns, 45.847 ns, 17.282 ns)

julia> @b rand(10_000) shuffle!($rng, _), myshuf!($rng, _), myshuf2!($rng, _)
(47.852 μs, 40.247 μs, 12.404 μs)

julia> @b rand(10_000_000) shuffle!($rng, _), myshuf!($rng, _), myshuf2!($rng, _)
(157.598 ms, 263.611 ms, 93.257 ms)

Removing the @inbounds annotations doesn't make a big difference, even for small vectors - so we probably should.

@Tortar
Copy link
Contributor

Tortar commented Mar 17, 2025

On a related note it seems like we can squeeze some more performance from rand(::UnitRange), should this be fixed too?

julia> using Random, BenchmarkTools

julia> rng = Xoshiro(42)

julia> f(rng, i) = @inline rand(rng, Random.Sampler(rng, 1:i, Val(1)))
f (generic function with 2 methods)

julia> g(rng, i) = rand(rng, 1:i)
g (generic function with 1 method)

julia> @benchmark f($rng, $10)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations per sample.
 Range (min  max):  4.017 ns  49.573 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     4.028 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.059 ns ±  0.529 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

          █                                                   
  ▂▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▅▅▁▁▁▁▁▁▂▂▁▁▁▁▁▁▁▂▁▁▁▁▁▁▂▂▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▂ ▂
  4.02 ns        Histogram: frequency by time        4.09 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark g($rng, $10)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations per sample.
 Range (min  max):  4.969 ns  33.442 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     4.980 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.011 ns ±  0.386 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █     ▃            ▂                                        
  █▁▁▁▁▁█▅▁▁▁▁▂▂▁▁▁▁▁█▁▁▁▁▁▆▄▁▁▁▁▃▃▁▁▁▁▁▃▁▁▁▁▁▂▂▁▁▁▁▃▂▁▁▁▁▁▃ ▂
  4.97 ns        Histogram: frequency by time        5.06 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

@jakobnissen
Copy link
Member

Yeah that's a good idea. Probably the solution here is not to write a custom method, but instead to make sure all these helper functions (e.g. instantiating Sampler etc) inline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster randomness Random number generation and the Random stdlib
Projects
None yet
Development

No branches or pull requests

4 participants