Skip to content

Add mapreduce.#134

Draft
maleadt wants to merge 2 commits intomainfrom
tb/mapreduce
Draft

Add mapreduce.#134
maleadt wants to merge 2 commits intomainfrom
tb/mapreduce

Conversation

@maleadt
Copy link
Member

@maleadt maleadt commented Mar 21, 2026

Initial implementation. Performs a little better than CUDA.jl already, at least on the simple cases I tested:

julia> A = CUDA.rand(1024, 1024, 1024);


julia> CUDA.@profile trace=true sum(A)

Device-side activity: GPU was busy for 7.9 ms (33.76% of the trace)
┌─────────┬─────────┬────────┬──────┬──────────────────┬───────────────────────┐
│    Time │ Threads │ Blocks │ Regs │       Shared Mem │ Name                  │
├─────────┼─────────┼────────┼──────┼──────────────────┼───────────────────────┤
│  7.9 ms │    1024 │     84 │   60 │ 128 bytes static │ simt_reduce_kernel    │
│ 2.38 µs │      96 │      1 │   70 │ 128 bytes static │ simt_reduce_kernel    │
└─────────┴─────────┴────────┴──────┴──────────────────┴───────────────────────┘


julia> CUDA.@profile trace=true sum(ct.Tiled(A))

Device-side activity: GPU was busy for 5.22 ms (2.55% of the trace)
┌───────────┬─────────┬─────────┬──────┬───────────────────┬─────────────────────┐
│      Time │ Threads │  Blocks │ Regs │        Shared Mem │ Name                │
├───────────┼─────────┼─────────┼──────┼───────────────────┼─────────────────────┤
│   5.22 ms │     128 │ 1×1×128 │   76 │ 32.035 KiB static │ tiled_reduce_kernel │
│   1.67 µs │     128 │       1 │   21 │   28 bytes static │ tiled_reduce_kernel │
└───────────┴─────────┴─────────┴──────┴───────────────────┴─────────────────────┘

It's a bit unfortunate there's so much duplication with GPUArrays, but making Tiled <: AbstractGPUArray probably breaks more than it is worth.

Implementation could be much improved, e.g., by using atomic reductions instead of a multi-phased reduction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant