-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backend cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels #9921
Conversation
e7974bb
to
c9c1afb
Compare
I think this is a step in the right direction, but I am not convinced about the current implementation. Generally, the way changes in tensor layout are intended to be implemented is through the ggml-backend buffer interface, it gives more control to the application over which tensors will be changed, it allows changes to the tensor size, and the conversion would be done at load time. Doing it this way may cause some tensors to be unintentionally converted, such as a quantized KV cache. However, the llama.cpp model loader at the moment does not have a good way to support this, but I am working on that at the moment. Note that there are AVX implementations for
patchdiff --git a/ggml/src/ggml-aarch64.c b/ggml/src/ggml-aarch64.c
index 700e66a0..4060d78e 100644
--- a/ggml/src/ggml-aarch64.c
+++ b/ggml/src/ggml-aarch64.c
@@ -3305,4 +3305,12 @@ void ggml_prepare_optimal_kernel(struct ggml_tensor *cur, uint8_t **pmem, size_t
}
}
#endif
+
+#if defined(__AVX2__) || defined(__AVX512F__)
+ if (cur->type == GGML_TYPE_Q4_0) {
+ if (repack_q4_0_to_q4_0_8_bl(cur, 8, pmem, psize) == 0) {
+ cur->type = GGML_TYPE_Q4_0_8_8;
+ }
+ }
+#endif
} |
@slaren Thank you for the review and valuable feedback! I understand the direction you're suggesting, particularly with aligning with the ggml-backend buffer interface. Could you provide more details on this approach? Specifically, I’m curious how it would integrate, given that the mulmat tensor is currently constructed during the graph build, which occurs after the model loader. I also wanted to ask about the timeline for the llama.cpp model loader improvements that would support this. If those changes aren’t expected to be completed soon, I suggest we merge the current PR with the necessary updates to ensure functionality in the short term. In parallel, we will start working on a more aligned implementation that integrates with the ggml-backend buffer interface. Please let me know your thoughts on this. |
I am working on the llama model loader at the moment. One of the changes that I will make is that it will be able to choose the buffer type to offload each tensor depending on the operations in which it will be used. This is mainly to prevent offloading tensors with types that are not supported by a backend, but it will also be useful for implementing this. It shouldn't take too long until this is merged. I think that this approach is too error prone to merge as it is. There are at least two cases that I am aware of that will not work:
Both of these will crash with this PR. It may be possible to fix these issues specifically, but fundamentally the problem is that by modifying the tensors in the backend, this breaks the assumptions that applications make about the way ggml uses the tensors. It would be a constant source of problems, and it will be hard for other ggml applications to take advantage of this. In the meanwhile, llama.cpp users can already take advantage of the performance boost of these types by converting the model beforehand using |
@slaren Thank you for your detailed feedback. I'll hold off on this PR and wait for your patch that allows the model loader to choose buffer types based on tensor operations. Once that is in place, I'll refactor my implementation accordingly. |
It should be possible now to implement this with a new buffer type as outlined in this comment. Please let me know if you find any issues. |
@slaren thank you for your guidance and for providing a clear direction on implementing the changes. I am currently working on refactoring the online flow patch to incorporate the new buffer type as outlined. I’ll make sure to reach out if I encounter any issues during the process. |
c9c1afb
to
ba21c82
Compare
@slaren I've refactored the patch to add a new buffer type for runtime weight quantization on AARCH64 CPUs. This feature is enabled by the build option GGML_CPU_AARCH64. |
Thanks. Looks like something went wrong in the last commit and thousands of files were added, I will try to review this after this is fixed. |
ba21c82
to
e44a529
Compare
Hmm..I've tried reset to HEAD~1 and add the corresponding files. It still dosn't seem to be right. Not sure what went wrong but Is it OK I create a new PR for the patch and close this one? |
Up to you. I think the changes to |
There are no changes to ggml-cpu.c from this patch. |
I see that instead of checking the buffer type of the tensor in the matrix multiplication, it is changing the type of tensor. This should not be done, this will lead to other problems. Instead, check the buffer type of the tensor in the matrix multiplication and add the necessary logic to pick the correct gemm function there. |
@slaren Thanks for the quick review. Could you elaborate further on the issue? From a logical perspective, the weight tensor type changes as part of the weight reshape process. |
The goal of using a buffer type to handle the conversion is to hide it from the user. From what the user is concerned, the tensor type is Q4_0, and if they use the You can use |
@slaren thank you for the detailed explanation. I understand the importance of keeping the tensor type consistent to avoid the issues faced with the original implementation. I've updated the code to retain the tensor type as Q4_0 and instead use the buffer type to select the appropriate GEMM/GEMV functions. Please let me know if there are any other adjustments needed. |
I have pushes some fixes, please check that everything still works as expected on aarch64 machines. Question: what happens when the number of rows in the tensors is not a multiple of the interleave size? It looks to me that this would cause a buffer overflow. If so, that needs to be fixed so that the conversion is not used if the number of rows is not compatible. I also found that the conversion is quite slow, particularly to q4_0_4x8. I imagine that this could be improved significantly by improving the memory access pattern and using multiple threads. @Srihari-mcw tagging you since you wrote the AVX implementation of Q4_0_8_8. Compared to Q4_0, on my machine I found that it improves gemm performance significantly, but gemv performance is also much slower. If that was fixed, this could be enabled as well for x86 machines. |
On a side note, we need a better name for these types than "aarch64" because at this point there are implementation for many other systems. Maybe "interleaved types". |
@slaren thank you for the review. regarding your question: Yes, you are correct. If the number of rows in the tensors is not a multiple of the interleave size, it would indeed cause issues, potentially leading to a buffer overflow. I will verify that the patch still works as expected with the latest trunk and your fixes. Additionally, I will prepare a new commit to address the issues you raised, including ensuring proper handling for cases where the number of rows is incompatible. |
Were you referring to renaming the new buffer type and the related functions? For instance, would updating the following function:
to something like this:
|
Yes, something along these lines, but more generic. Rename |
I've verified the PR #10196 on aarch64 machines, including Pixel8 and MacBook M3. |
Hi @slaren , the models that are currently used by us generally goes through the GEMM function majority of the times and we see good gains. Could you suggest any other model that you could possibly be using at your end to replicate this performance behaviour with the GEMV function and this can be worked upon. Thanks |
a0a4646
to
871036d
Compare
@slaren I've pushed a new commit that checks the tensor dimensions. I've also changed the build option name to GGML_RUNTIME_REPACK to make it a bit more precise. There are several options to speed up the loading time. One approach could be optimizing the conversion, as you suggested. Another option could be caching the reshaped weights, which would reduce the loading time for the subsequent runs. |
@Srihari-mcw I don't think it is specific to one model, I generally see lower generation performance with Q4_0_8_8 compared to Q4_0 on my machine. My CPU is an Intel 13900k. Here are some results with llama 7B, llama 3.1 8B, and tinyllama 1B:
|
I'm working on a patch to address this. The load-time for phi2 on a Pixel 8 went down from 9400ms to 1500ms. |
ggml/src/ggml-backend.cpp
Outdated
/* .init_tensor = */ ggml_backend_cpu_aarch64_buffer_init_tensor, | ||
/* .memset_tensor = */ ggml_backend_cpu_buffer_memset_tensor, | ||
/* .set_tensor = */ ggml_backend_cpu_aarch64_buffer_set_tensor, | ||
/* .get_tensor = */ NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is going to be necessary to implement the get_tensor
function to return a copy of the tensor in the original Q4_0 format. This is because some backends such as CUDA and Vulkan will offload computation of large batches to the GPU, and to do that the tensor needs to be copied to VRAM in the Q4_0 format. But it may be better to disable this behavior entirely for repacked tensors by only doing so when the weight is stored in a host buffer, since that ensures that no conversions are required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is going to be necessary to implement the get_tensor function to return a copy of the tensor in the original Q4_0 format. This is because some backends such as CUDA and Vulkan will offload computation of large batches to the GPU, and to do that the tensor needs to be copied to VRAM in the Q4_0 format.
I understand the requirement to revert the repacked weight data back to the original Q4_0 format in the get_tensor function to ensure compatibility with CUDA and Vulkan backends. This way, when these backends offload computations for large batches to the GPU, the tensor will be correctly formatted for transfer to VRAM. I will proceed with implementing this conversion logic.
But it may be better to disable this behavior entirely for repacked tensors by only doing so when the weight is stored in a host buffer, since that ensures that no conversions are required.
I would appreciate some additional clarification on this point. If I understand correctly, are you suggesting that we disable support for CUDA and Vulkan for tensors that have been repacked? This would imply that repacked tensors should not be used with GPU backends to avoid the need for conversion. Please confirm if our interpretation is accurate or if you have a different approach in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct in your interpretation. My concern is that if we don't implement the logic to convert the tensors back to Q4_0
in the get_tensor
function, then it will result in a performance loss when processing prompts with a GPU backend (even without layers offloaded, since the GPU is still used for processing prompts). However, in that case, the best solution may be to disable the weight repacking entirely, since repacking doesn't help with generation performance anyway, and prompt processing would be done in the GPU, so there is no reason to repack the weights at all. I think this will need to be handled in the llama.cpp instead and add an exception for this case.
Anyway, this is not an issue at the moment since it is only enabled for ARM currently. You don't need to do anything, I will solve the merge conflicts and merge this.
1875733
to
749a9e5
Compare
@slaren thank you for your guidance and support in merging this PR. Your idea to handle the repacking logic more efficiently within llama.cpp makes a lot of sense, especially given the impact on GPU prompt processing. I’m grateful for your for taking care of the merge conflicts as well. Working on this project has been a great learning experience, thanks to your expertise and direction. Looking forward to contributing further as needed. |
Added CPU backend online flow, allowing runtime requantization and repacking of Q4_0 to enable optimized GEMM and GEMV kernels. This feature can be enabled with the runtime option -rtrp (--runtime-repack).
Example of using the runtime option for benchmark on Graviton 3:
$./llama-bench -m phi-2.Q4_0.gguf -t 4 -rtrp 1,0