Skip to content

Conversation

@VarLad
Copy link
Owner

@VarLad VarLad commented Nov 12, 2025

My attempt at porting (shmem) mapreduce from CUDA.jl to OpenCL.jl

Some observations from local benchmarks testing on 10000 x 10000 Float32 matrix with input sum(x, dims=1):

  • On pocl: The serial_mapreduce kernel is 4x faster than this branch's parallel_mapreduce, which itself is 2x faster than master branch's implementation of parallel_mapreduce
  • On rusticl+zink(RTX4060M): The parallel_mapreduce kernel reaches half the speeds of CUDA.jl's shmem mapreduce, and is generally faster than AcceleratedKernels.sum.

For pocl, it could be the case that lowering the local_size and global_size helps with performance.
For parallel_mapreduce, I noticed that the global_size in kernel launch is quite huge, especially as the max_workgroup_size reported by the kernel is 4096 in pocl (for contrast its 768 in CUDA.jl and 1024 in Zink backend in OpenCL.jl)
I haven't tested for the values of local_size and global_size thoroughly, because the few values that I tried artificially setting for local_size and global_size made it run endlessly (or atleast a substantially long time).
Will investigate this further, in the future.

@VarLad VarLad changed the title Update mapreduce.jl faster mapreduce.jl Nov 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants