Pingpong GEMM from scratch

I wrote this kernel to see if I could match CUTLASS's "pingpong" GEMM algorithm using hand-written CUDA. I used https://github.com/pranjalssh/fast.cu by Pranjal Shankhdhar as a starting point, having been heavily inspired by the fantastic blog post Outperforming cuBLAS on H100.

You can run a quick check of the kernel with:

make gemm && ./gemm

And run a sweep through a bunch of different shapes with:

python setup.py develop && python benchmark.py

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
.clang-format		.clang-format
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
benchmark.py		benchmark.py
benchmark.sh		benchmark.sh
denoise-h100.sh		denoise-h100.sh
gemm.cu		gemm.cu
l2.py		l2.py
layout.py		layout.py
main.cu		main.cu
maxreg.cu		maxreg.cu
op.cpp		op.cpp
pingpong.cu		pingpong.cu
setup.py		setup.py
stmatrix.cu		stmatrix.cu
stmatrix.py		stmatrix.py
test.py		test.py

Provide feedback