faster mapreduce.jl #3
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
My attempt at porting (shmem) mapreduce from CUDA.jl to OpenCL.jl
Some observations from local benchmarks testing on 10000 x 10000 Float32 matrix with input
sum(x, dims=1):AcceleratedKernels.sum.For pocl, it could be the case that lowering the local_size and global_size helps with performance.
For parallel_mapreduce, I noticed that the global_size in kernel launch is quite huge, especially as the max_workgroup_size reported by the kernel is 4096 in pocl (for contrast its 768 in CUDA.jl and 1024 in Zink backend in OpenCL.jl)
I haven't tested for the values of local_size and global_size thoroughly, because the few values that I tried artificially setting for local_size and global_size made it run endlessly (or atleast a substantially long time).
Will investigate this further, in the future.