Adding new cublas flags? #13316

mkgai · 2025-05-05T15:20:28Z

mkgai
May 5, 2025

Greetings -

Can anyone point me to the headers needed to begin trial and error implementation of various new cublas build variables? Should I be working with the fattn...cuh and related ggml headers, or are there any llama specific headers I should be specifically looking into?

I appreciate any pointers - novice here, although I'm sure I've made that apparent already.

There are some preexisting variables I want to work with, and also some new following recent changes to cuda toolkit 12.9.

I am wanting to set and benchmark:
cublasSetEmulationStrategy() cublasEmulationStrategy_t
cublasEmulationStrategy_t CUBLAS_EMULATION_STRATEGY_PERFORMANT
vs
cublasEmulationStrategy_t CUBLAS_EMULATION_STRATEGY_EAGER

and since CUBLAS_TENSOR_OP_MATH is deprecated I am wanting to elect either CUBLAS_DEFAULT_MATH or CUBLAS_TF32_TENSOR_OP_MATH directly.

It would also be nice to run the occasional Bfloat, so I imagine that would look something like:
cublasLtMatmul() cublasComputeType_t
cublasComputeType_t CUBLAS_COMPUTE_32F_FAST_16BF

I don't imagine cmake knows how to handle these build variables without specific corresponding flags, right?

Anyway, if anyone can suggest a starting point, that would be great. Thanks

JohannesGaessler · 2025-05-07T08:17:10Z

JohannesGaessler
May 7, 2025
Collaborator

In ggml/src/ggml-cuda/ggml-cuda.cu there is a function ggml_cuda_compute_forward that serves as an entrypoint for all ggml tensor operations. Matrix multiplications then go to ggml_cuda_mul_mat where the exact kernel is chosen based on data types and GPUs. The cuBLAS code specifically is in ggml_cuda_mul_mat_batched_cublas and ggml_cuda_op_mul_mat_cublas. The situation with the latter is a bit complicated because it goes through ggml_cuda_op_mul_mat which assembles the tensors as contiguous buffers containing only a single matrix for ggml_cuda_op_mul_mat_cublas and also handles the case where tensor parallelism is attempted with --split-mode row. One of my current goals is to remove ggml_cuda_op_mul_mat and to only have direct calls to functions such as ggml_cuda_mul_mat_q. For this all matrix multiplications will need full support for non-contiguous inputs/outputs as well as batching. The tensor parallelism should be handled generically via ggml backends instead of inside the CUDA backend. If you want to just play around with optimizing cuBLAS it's enough to just modify ggml_cuda_op_mul_mat_cublas and ggml_cuda_mul_mat_batched_cublas but a general refactor of the code would also very much be appreciated.

1 reply

mkgai May 9, 2025
Author

Got it, you have my thanks - I'll post progress if I get anywhere with this.

jukofyork · 2025-05-10T21:43:56Z

jukofyork
May 10, 2025
Collaborator

I tried using CUBLAS_COMPUTE_32F_FAST_16BF a while back when fiddling with deepseek MLA code, and it seemed pretty much useless (tried on both A6000 Ampere and RTX 5000 Ada).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding new cublas flags? #13316

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Adding new cublas flags? #13316

mkgai May 5, 2025

Replies: 2 comments · 1 reply

JohannesGaessler May 7, 2025 Collaborator

mkgai May 9, 2025 Author

jukofyork May 10, 2025 Collaborator

mkgai
May 5, 2025

Replies: 2 comments 1 reply

JohannesGaessler
May 7, 2025
Collaborator

mkgai May 9, 2025
Author

jukofyork
May 10, 2025
Collaborator