Faster (still slow) fallback matrix multiplication #590

christiangnrd · 2025-04-13T18:26:24Z

Taken from the KernelAbstractions.jl performant matmul example.

I had to make a few changes, such as using unsafe_indices, since the algorithm itself does the bounds checking, and I was getting wrong results until I added that.

~~I also made it so I and J are only fetched once. Not sure if the old way is outdated or to prevent a bug I didn't encounter.~~ Edit: Guess i found out why that was there. Why is it only necessary for some backends and why is the other way working for nightly?

Finally, I made tile size 16 instead of 32 since it cannot be set dynamically, and Metal does not always have 1024 (32*32) threads per threadgroup available.

maleadt · 2025-04-14T06:12:38Z

I had to make a few changes, such as using unsafe_indices, since the algorithm itself does the bounds checking, and I was getting wrong results until I added that.

Oof, that's bad, and unexpected. cc @vchuravy

Guess i found out why that was there.

Care to elaborate?

Any performance numbers?

christiangnrd · 2025-04-14T17:25:49Z

I had to make a few changes, such as using unsafe_indices, since the algorithm itself does the bounds checking, and I was getting wrong results until I added that.

Oof, that's bad, and unexpected. cc @vchuravy

To reproduce, you can apply this patch:

Patch

diff --git a/src/host/linalg.jl b/src/host/linalg.jl
index b59598f..2e51d9f 100644
--- a/src/host/linalg.jl
+++ b/src/host/linalg.jl
@@ -326,7 +326,7 @@ function LinearAlgebra.ldiv!(B::AbstractGPUVecOrMat,
 end

 # XXX: figure out how to do dynamically
-MAX_TILE_DIM = 16
+MAX_TILE_DIM = 2 # THIS CHANGE MADE TO SIMPLIFY MWE OUTPUT

 ## matrix multiplication
 # legacy method
@@ -346,7 +346,7 @@ function generic_matmatmul!(C::AbstractGPUMatrix{R}, A::AbstractGPUMatrix{T}, B:
         return fill!(C, zero(R))
     end

-    @kernel unsafe_indices=true function coalesced_matmul_kernel!(
+    @kernel function coalesced_matmul_kernel!(
             output, @Const(input1), @Const(input2), N, Q, M,
             ::Val{BANK} = Val(1),
         ) where {BANK}
@@ -408,7 +408,7 @@ function generic_matmatmul!(C::AbstractGPUMatrix{R}, A::AbstractGPUMatrix{T}, B:
         end
     end

-    coalesced_matmul_kernel!(get_backend(C), (MAX_TILE_DIM, MAX_TILE_DIM))(C, A, B, N, Q, M;ndrange=map(x -> ceil(Int,x/MAX_TILE_DIM)*MAX_TILE_DIM, size(C)))
+    coalesced_matmul_kernel!(get_backend(C), (MAX_TILE_DIM, MAX_TILE_DIM))(C, A, B, N, Q, M;ndrange=size(C))
     C
 end
 function generic_matmatmul!(C::AbstractArray{R}, A::AbstractArray{T}, B::AbstractArray{S}, add::MulAddMul) where {T,S,R}

And then when run using Metal/CUDA, it gives:

julia> using Metal, GPUArrays; a = Metal.ones(3,3); b = Metal.ones(3,3); c = Metal.zeros(3,3); GPUArrays.generic_matmatmul!(c,a,b, true, false)
Precompiling Metal...
  3 dependencies successfully precompiled in 13 seconds. 66 already precompiled.
3×3 MtlMatrix{Float32, Metal.PrivateStorage}:
 3.0  3.0   2.0
 3.0  3.0   2.0
 2.0  2.0  16.6155

# With CUDA
julia> using CUDA, GPUArrays; a = CUDA.ones(3,3); b = CUDA.ones(3,3); c = CUDA.zeros(3,3); GPUArrays.generic_matmatmul!(c,a,b, true, false)
Precompiling CUDA...
  3 dependencies successfully precompiled in 38 seconds. 96 already precompiled.
3×3 CuArray{Float32, 2, CUDA.DeviceMemory}:
 3.0  3.0  2.0
 3.0  3.0  2.0
 2.0  2.0  2.0

It seems to not be broken with JLArrays, and with CUDA it seems less broken in that it results in close-to-integer-valued results in the other 3 quadrants. This happens on KA 0.9.34, I haven't tested with master branch.

Guess i found out why that was there.

Care to elaborate?

Yes sorry about that. CI was showing that on some platforms, I and J were no longer in scope/defined after an @synchronize call. Without looking into it for this case specifically, I assume it has to do with the code transformations the @kernel macro does.

Any performance numbers?

It seems to be at least as fast and up to 4-5x faster than the naive algorithm.

Linux Ryzen 3700X with RTX 3060 (note the different y-axes in the bottom row):

M2 Max 30 core:

christiangnrd · 2025-04-19T00:42:01Z

Based on JuliaGPU/KernelAbstractions.jl#590 passing tests, maybe this should wait until GPUArrays supports KA v0.10

christiangnrd mentioned this pull request Apr 18, 2025

Update performant_matmul.jl JuliaGPU/KernelAbstractions.jl#590

Closed

christiangnrd closed this Apr 19, 2025

christiangnrd reopened this Apr 19, 2025

christiangnrd force-pushed the fastmatmul branch 2 times, most recently from 3892d1b to 1499c12 Compare April 25, 2025 15:03

Faster matmul

416b28f

christiangnrd force-pushed the fastmatmul branch from 1499c12 to 416b28f Compare May 13, 2025 16:28

christiangnrd mentioned this pull request May 15, 2025

Test correct backend in examples test JuliaGPU/KernelAbstractions.jl#597

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Faster (still slow) fallback matrix multiplication #590

Faster (still slow) fallback matrix multiplication #590

Uh oh!

christiangnrd commented Apr 13, 2025 •

edited

Loading

Uh oh!

maleadt commented Apr 14, 2025

Uh oh!

christiangnrd commented Apr 14, 2025 •

edited

Loading

Uh oh!

christiangnrd commented Apr 19, 2025 •

edited

Loading

Uh oh!

Uh oh!

Faster (still slow) fallback matrix multiplication #590

Are you sure you want to change the base?

Faster (still slow) fallback matrix multiplication #590

Uh oh!

Conversation

christiangnrd commented Apr 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maleadt commented Apr 14, 2025

Uh oh!

christiangnrd commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

christiangnrd commented Apr 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

christiangnrd commented Apr 13, 2025 •

edited

Loading

christiangnrd commented Apr 14, 2025 •

edited

Loading

christiangnrd commented Apr 19, 2025 •

edited

Loading