faster mapreduce.jl #3

VarLad · 2025-11-12T06:46:04Z

My attempt at porting (shmem) mapreduce from CUDA.jl to OpenCL.jl

Some observations from local benchmarks testing on 10000 x 10000 Float32 matrix with input sum(x, dims=1):

On pocl: The serial_mapreduce kernel is 4x faster than this branch's parallel_mapreduce, which itself is 2x faster than master branch's implementation of parallel_mapreduce
On rusticl+zink(RTX4060M): The parallel_mapreduce kernel reaches half the speeds of CUDA.jl's shmem mapreduce, and is generally faster than AcceleratedKernels.sum.

For pocl, it could be the case that lowering the local_size and global_size helps with performance.
For parallel_mapreduce, I noticed that the global_size in kernel launch is quite huge, especially as the max_workgroup_size reported by the kernel is 4096 in pocl (for contrast its 768 in CUDA.jl and 1024 in Zink backend in OpenCL.jl)
I haven't tested for the values of local_size and global_size thoroughly, because the few values that I tried artificially setting for local_size and global_size made it run endlessly (or atleast a substantially long time).
Will investigate this further, in the future.

Update mapreduce.jl

6954b27

VarLad changed the title ~~Update mapreduce.jl~~ faster mapreduce.jl Nov 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

faster mapreduce.jl #3

faster mapreduce.jl #3

Uh oh!

VarLad commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

faster mapreduce.jl #3

Are you sure you want to change the base?

faster mapreduce.jl #3

Uh oh!

Conversation

VarLad commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants