Skip to content

Added wrapped C cuda code and runable examples#1

Merged
GiggleLiu merged 3 commits intomainfrom
dev
Jul 5, 2023
Merged

Added wrapped C cuda code and runable examples#1
GiggleLiu merged 3 commits intomainfrom
dev

Conversation

@xuanzhaogao
Copy link
Copy Markdown
Collaborator

No description provided.

@xuanzhaogao
Copy link
Copy Markdown
Collaborator Author

I compared the speed of the our cuda code and tropical operator by GemmKernels in julia 1.7, the result is shown below
image
GemmKernels perform much better in julia 1.7 than in julia 1.6, but still can only reach about half of the maxium performance.

I wrapped our CUDA code as .so lib and added a julia interface, related tests are also created.

Comment thread src/SGEMM/CuTropicalSGEMM_example.jl Outdated
Comment thread src/SGEMM/CuTropicalSGEMM_example.jl Outdated
Comment thread src/SGEMM/TropicalSGemm.cu
Comment thread src/SGEMM/benchmark_GemmKerenls_Tropical.jl Outdated
@xuanzhaogao
Copy link
Copy Markdown
Collaborator Author

xuanzhaogao commented Jul 5, 2023

Sorry for the previous chaos, I thought these parts will not be publish as part of the package.

The following changes have been made:

  • The .so file is uploaded to gist as an artifact, so that there no more binary in the repo now.
  • I relocated all the files into folder src, test and benchmark.
  • Scripts used for benchmarks are given, including the fall back implementation in CUDA.jl. However I found something strange: it seems that CUDA.@sync do not work when using the function from a .so lib, so I failed the benchmark our code in julia.

The new benchmark result is show here:
image

@GiggleLiu
Copy link
Copy Markdown
Member

Hi @maleadt, I am mentoring an Open Source Promotion Plan student to implement Tropical GEMM on GPUs. Regarding the recent update in GemmKernels.jl: JuliaGPU/GemmKernels.jl#101, I was suggesting him to try the GemmKernels.jl to make the implementation compatible with Julia CUDA ecosystem. However from the above benchmark, we can see its performance is not as good as the 600 line C code.

We might need your help to decide which way to go is technically more feasible:

  1. We can either implement the TropicalGEMM in C and port it to CUDA ecosystem, or
  2. try polishing the GemmKernels implementation.

Also, @ArrogantGao found CUDA.@sync do not work when call a function from a .so lib. I can not find any issue discussing how CUDA.@sync interact with .so lib. So if you could provide a direction to investigate, that would be very helpful. If there are any existing sample project for reference, then it would be perfect.

NOTE: All the benchmarks and implementations are included in this repo.

Copy link
Copy Markdown
Member

@GiggleLiu GiggleLiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the changes look great, well done!

Comment thread .vscode/settings.json Outdated
@@ -0,0 +1 @@
{} No newline at end of file
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vscode configuration files should not be commited.

Comment thread Artifacts.toml
Comment thread src/TropicalGemmFP32.cu
@@ -0,0 +1,627 @@
// This CUDA code is modified based on github repo https://github.com/Yinghan-Li/YHs_Sample, which is under GPL 3.0 License
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Holy, the GPL3 license, that is sexy. If we decide to keep this version in our code base, we have to include GPL3 license.

To “propagate” a work means to do anything with it that, without permission, would make you directly or secondarily liable for infringement under applicable copyright law, except executing it on a computer or modifying a private copy. Propagation includes copying, distribution (with or without modification), making available to the public, and in some countries other activities as well.

@maleadt
Copy link
Copy Markdown

maleadt commented Jul 5, 2023

try polishing the GemmKernels implementation.

I would recommend doing so. An all-Julia implementation is always preferable, for so many reasons: support for different datatypes, easier to tune using metaprogramming instead of the hard-coded 128x128x8 here, easier for other people to contribute to, etc. The code generated by GemmKernels.jl is generally pretty good, so it should be possible to compare the generated PTX code of both implementations, and/or use NSight Compute to compare executions. Maybe it's something simple, like GemmKernels.jl not using ldg. It's possible that it's more serious, like how we use 64-bit integers for pointer arithmetic (JuliaGPU/CUDA.jl#1895), but I wouldn't expect so with a memory-bound kernel.

@xuanzhaogao
Copy link
Copy Markdown
Collaborator Author

Remove the .vscode file and changed the license to GPL 3.0 (indeed, I also like that better).

@GiggleLiu
Copy link
Copy Markdown
Member

GiggleLiu commented Jul 5, 2023

@maleadt Thank you first your prompt reply. @ArrogantGao Let us do some profiling and get some understanding about the performance issues.
This papge is how to profile GPU code with CUDA.jl: https://cuda.juliagpu.org/stable/development/profiling/

Let me merge the PR first, and move the discussion to: #2 , we can update the profiling result and generated ptx code there.

@GiggleLiu GiggleLiu merged commit 895cfa5 into main Jul 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants