Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backend cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels #9921

Merged
merged 8 commits into from
Nov 15, 2024

Conversation

chaxu01
Copy link
Contributor

@chaxu01 chaxu01 commented Oct 17, 2024

Added CPU backend online flow, allowing runtime requantization and repacking of Q4_0 to enable optimized GEMM and GEMV kernels. This feature can be enabled with the runtime option -rtrp (--runtime-repack).

Example of using the runtime option for benchmark on Graviton 3:

$./llama-bench -m phi-2.Q4_0.gguf -t 4 -rtrp 1,0

model size params backend threads repack test t/s
phi2 3B Q4_0 1.49 GiB 2.78 B CPU 4 1 pp512 110.84 ± 0.01
phi2 3B Q4_0 1.49 GiB 2.78 B CPU 4 1 tg128 39.42 ± 0.02
phi2 3B Q4_0 1.49 GiB 2.78 B CPU 4 0 pp512 38.03 ± 0.01
phi2 3B Q4_0 1.49 GiB 2.78 B CPU 4 0 tg128 16.95 ± 0.01

@github-actions github-actions bot added examples ggml changes relating to the ggml tensor library for machine learning labels Oct 17, 2024
@slaren
Copy link
Collaborator

slaren commented Oct 17, 2024

I think this is a step in the right direction, but I am not convinced about the current implementation. Generally, the way changes in tensor layout are intended to be implemented is through the ggml-backend buffer interface, it gives more control to the application over which tensors will be changed, it allows changes to the tensor size, and the conversion would be done at load time. Doing it this way may cause some tensors to be unintentionally converted, such as a quantized KV cache. However, the llama.cpp model loader at the moment does not have a good way to support this, but I am working on that at the moment.

Note that there are AVX implementations for Q4_0_8_8 gemm, so with a small change this can also benefit x86 processors (tested on 13900k):

model size params backend threads repack test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CPU 16 0 pp512 50.56 ± 0.47
llama 7B Q4_0 3.56 GiB 6.74 B CPU 16 0 tg128 20.79 ± 0.03
llama 7B Q4_0 3.56 GiB 6.74 B CPU 16 1 pp512 64.79 ± 0.32
llama 7B Q4_0 3.56 GiB 6.74 B CPU 16 1 tg128 15.80 ± 0.05
patch
diff --git a/ggml/src/ggml-aarch64.c b/ggml/src/ggml-aarch64.c
index 700e66a0..4060d78e 100644
--- a/ggml/src/ggml-aarch64.c
+++ b/ggml/src/ggml-aarch64.c
@@ -3305,4 +3305,12 @@ void ggml_prepare_optimal_kernel(struct ggml_tensor *cur, uint8_t **pmem, size_t
         }
     }
 #endif
+
+#if defined(__AVX2__) || defined(__AVX512F__)
+    if (cur->type == GGML_TYPE_Q4_0) {
+        if (repack_q4_0_to_q4_0_8_bl(cur, 8, pmem, psize) == 0) {
+            cur->type = GGML_TYPE_Q4_0_8_8;
+        }
+    }
+#endif
 }

@chaxu01
Copy link
Contributor Author

chaxu01 commented Oct 18, 2024

@slaren Thank you for the review and valuable feedback!

I understand the direction you're suggesting, particularly with aligning with the ggml-backend buffer interface. Could you provide more details on this approach? Specifically, I’m curious how it would integrate, given that the mulmat tensor is currently constructed during the graph build, which occurs after the model loader.

I also wanted to ask about the timeline for the llama.cpp model loader improvements that would support this. If those changes aren’t expected to be completed soon, I suggest we merge the current PR with the necessary updates to ensure functionality in the short term. In parallel, we will start working on a more aligned implementation that integrates with the ggml-backend buffer interface.

Please let me know your thoughts on this.

@slaren
Copy link
Collaborator

slaren commented Oct 18, 2024

I am working on the llama model loader at the moment. One of the changes that I will make is that it will be able to choose the buffer type to offload each tensor depending on the operations in which it will be used. This is mainly to prevent offloading tensors with types that are not supported by a backend, but it will also be useful for implementing this. It shouldn't take too long until this is merged.

I think that this approach is too error prone to merge as it is. There are at least two cases that I am aware of that will not work:

  • Models with a shared token embeddings and output tensor. For example, try gemma-2-2b-it quantized to Q4_0 with --pure.
  • KV quantization with -ctk q4_0

Both of these will crash with this PR. It may be possible to fix these issues specifically, but fundamentally the problem is that by modifying the tensors in the backend, this breaks the assumptions that applications make about the way ggml uses the tensors. It would be a constant source of problems, and it will be hard for other ggml applications to take advantage of this.

In the meanwhile, llama.cpp users can already take advantage of the performance boost of these types by converting the model beforehand using llama-quantize.

@chaxu01
Copy link
Contributor Author

chaxu01 commented Oct 21, 2024

@slaren Thank you for your detailed feedback. I'll hold off on this PR and wait for your patch that allows the model loader to choose buffer types based on tensor operations. Once that is in place, I'll refactor my implementation accordingly.

@slaren
Copy link
Collaborator

slaren commented Nov 1, 2024

It should be possible now to implement this with a new buffer type as outlined in this comment. Please let me know if you find any issues.

@chaxu01
Copy link
Contributor Author

chaxu01 commented Nov 1, 2024

@slaren thank you for your guidance and for providing a clear direction on implementing the changes. I am currently working on refactoring the online flow patch to incorporate the new buffer type as outlined. I’ll make sure to reach out if I encounter any issues during the process.

@chaxu01
Copy link
Contributor Author

chaxu01 commented Nov 6, 2024

@slaren I've refactored the patch to add a new buffer type for runtime weight quantization on AARCH64 CPUs. This feature is enabled by the build option GGML_CPU_AARCH64.

@slaren
Copy link
Collaborator

slaren commented Nov 6, 2024

Thanks. Looks like something went wrong in the last commit and thousands of files were added, I will try to review this after this is fixed.

@chaxu01
Copy link
Contributor Author

chaxu01 commented Nov 6, 2024

Hmm..I've tried reset to HEAD~1 and add the corresponding files. It still dosn't seem to be right. Not sure what went wrong but Is it OK I create a new PR for the patch and close this one?

@slaren
Copy link
Collaborator

slaren commented Nov 6, 2024

Up to you. I think the changes to ggml-cpu.c are missing.

@chaxu01
Copy link
Contributor Author

chaxu01 commented Nov 6, 2024

I think the changes to ggml-cpu.c are missing

There are no changes to ggml-cpu.c from this patch.

@slaren
Copy link
Collaborator

slaren commented Nov 6, 2024

I see that instead of checking the buffer type of the tensor in the matrix multiplication, it is changing the type of tensor. This should not be done, this will lead to other problems. Instead, check the buffer type of the tensor in the matrix multiplication and add the necessary logic to pick the correct gemm function there.

@chaxu01
Copy link
Contributor Author

chaxu01 commented Nov 6, 2024

@slaren Thanks for the quick review. Could you elaborate further on the issue? From a logical perspective, the weight tensor type changes as part of the weight reshape process.

@slaren
Copy link
Collaborator

slaren commented Nov 6, 2024

The goal of using a buffer type to handle the conversion is to hide it from the user. From what the user is concerned, the tensor type is Q4_0, and if they use the get_tensor function to read it back, they should get the tensor in the original Q4_0 format. If you change the tensor type, we are back to the same kind of problems of the original implementation.

You can use tensor->extra to store other details about the tensor (eg. the actual converted type) in the init_tensor function of the buffer interface.

@chaxu01
Copy link
Contributor Author

chaxu01 commented Nov 7, 2024

@slaren thank you for the detailed explanation. I understand the importance of keeping the tensor type consistent to avoid the issues faced with the original implementation. I've updated the code to retain the tensor type as Q4_0 and instead use the buffer type to select the appropriate GEMM/GEMV functions. Please let me know if there are any other adjustments needed.

@slaren
Copy link
Collaborator

slaren commented Nov 7, 2024

I have pushes some fixes, please check that everything still works as expected on aarch64 machines.

Question: what happens when the number of rows in the tensors is not a multiple of the interleave size? It looks to me that this would cause a buffer overflow. If so, that needs to be fixed so that the conversion is not used if the number of rows is not compatible.

I also found that the conversion is quite slow, particularly to q4_0_4x8. I imagine that this could be improved significantly by improving the memory access pattern and using multiple threads.

@Srihari-mcw tagging you since you wrote the AVX implementation of Q4_0_8_8. Compared to Q4_0, on my machine I found that it improves gemm performance significantly, but gemv performance is also much slower. If that was fixed, this could be enabled as well for x86 machines.

@slaren
Copy link
Collaborator

slaren commented Nov 7, 2024

On a side note, we need a better name for these types than "aarch64" because at this point there are implementation for many other systems. Maybe "interleaved types".

@chaxu01
Copy link
Contributor Author

chaxu01 commented Nov 8, 2024

@slaren thank you for the review. regarding your question: Yes, you are correct. If the number of rows in the tensors is not a multiple of the interleave size, it would indeed cause issues, potentially leading to a buffer overflow.

I will verify that the patch still works as expected with the latest trunk and your fixes. Additionally, I will prepare a new commit to address the issues you raised, including ensuring proper handling for cases where the number of rows is incompatible.

@chaxu01
Copy link
Contributor Author

chaxu01 commented Nov 8, 2024

On a side note, we need a better name for these types than "aarch64" because at this point there are implementation for many other systems. Maybe "interleaved types".

Were you referring to renaming the new buffer type and the related functions? For instance, would updating the following function:

static const char * ggml_backend_cpu_aarch64_buffer_type_get_name(ggml_backend_buffer_type_t buft) {
    return "CPU_AARCH64";
    GGML_UNUSED(buft);
}

to something like this:

static const char * ggml_backend_cpu_q4_0_buffer_type_get_name(ggml_backend_buffer_type_t buft) {
    return "INTERLEAVED_Q4_0";
    GGML_UNUSED(buft);
}

@slaren
Copy link
Collaborator

slaren commented Nov 8, 2024

Yes, something along these lines, but more generic. Rename ggml-aarch64.c/h to ggml-interleaved.c/h, the buffer type to ggml_backend_cpu_interleaved_buffer_type_get_name, etc. In the future it can be extended to support more types than Q4_0. It doesn't have to be done now.

@chaxu01
Copy link
Contributor Author

chaxu01 commented Nov 8, 2024

I have pushes some fixes, please check that everything still works as expected on aarch64 machines.

I've verified the PR #10196 on aarch64 machines, including Pixel8 and MacBook M3.

@Srihari-mcw
Copy link
Contributor

Srihari-mcw commented Nov 8, 2024

@Srihari-mcw tagging you since you wrote the AVX implementation of Q4_0_8_8. Compared to Q4_0, on my machine I found that it improves gemm performance significantly, but gemv performance is also much slower. If that was fixed, this could be enabled as well for x86 machines.

Hi @slaren , the models that are currently used by us generally goes through the GEMM function majority of the times and we see good gains. Could you suggest any other model that you could possibly be using at your end to replicate this performance behaviour with the GEMV function and this can be worked upon. Thanks

@chaxu01
Copy link
Contributor Author

chaxu01 commented Nov 8, 2024

@slaren I've pushed a new commit that checks the tensor dimensions. I've also changed the build option name to GGML_RUNTIME_REPACK to make it a bit more precise. There are several options to speed up the loading time. One approach could be optimizing the conversion, as you suggested. Another option could be caching the reshaped weights, which would reduce the loading time for the subsequent runs.

@slaren
Copy link
Collaborator

slaren commented Nov 8, 2024

@chaxu01 the changes that you have pushed have removed the changes that I had made in a0a4646. Please try to rebase your changes on top of that commit, it is important to keep these changes.

@slaren
Copy link
Collaborator

slaren commented Nov 8, 2024

@Srihari-mcw I don't think it is specific to one model, I generally see lower generation performance with Q4_0_8_8 compared to Q4_0 on my machine. My CPU is an Intel 13900k. Here are some results with llama 7B, llama 3.1 8B, and tinyllama 1B:

model size params backend threads test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CPU 16 pp128 43.14 ± 0.05
llama 7B Q4_0 3.56 GiB 6.74 B CPU 16 tg32 20.47 ± 0.08
llama 7B Q4_0_8_8 3.56 GiB 6.74 B CPU 16 pp128 58.48 ± 0.08
llama 7B Q4_0_8_8 3.56 GiB 6.74 B CPU 16 tg32 17.14 ± 0.06
llama 8B Q4_0 4.33 GiB 8.03 B CPU 16 pp128 40.49 ± 0.15
llama 8B Q4_0 4.33 GiB 8.03 B CPU 16 tg32 17.94 ± 0.03
llama 8B Q4_0_8_8 4.33 GiB 8.03 B CPU 16 pp128 54.59 ± 0.04
llama 8B Q4_0_8_8 4.33 GiB 8.03 B CPU 16 tg32 15.20 ± 0.06
llama 1B Q4_0 606.53 MiB 1.10 B CPU 16 pp256 272.45 ± 0.60
llama 1B Q4_0 606.53 MiB 1.10 B CPU 16 tg64 98.05 ± 0.70
llama 1B Q4_0_8_8 606.53 MiB 1.10 B CPU 16 pp256 359.69 ± 0.50
llama 1B Q4_0_8_8 606.53 MiB 1.10 B CPU 16 tg64 96.62 ± 0.60

@chaxu01
Copy link
Contributor Author

chaxu01 commented Nov 10, 2024

@slaren thank you for pointing that out. I have rebased my changes on top of commit a0a4646 to include the fixes you made. Please let me know if anything else is required.

@eddnjjn
Copy link
Contributor

eddnjjn commented Nov 12, 2024

I also found that the conversion is quite slow, particularly to q4_0_4x8. I imagine that this could be improved significantly by improving the memory access pattern and using multiple threads.

I'm working on a patch to address this. The load-time for phi2 on a Pixel 8 went down from 9400ms to 1500ms.

ggml/CMakeLists.txt Outdated Show resolved Hide resolved
/* .init_tensor = */ ggml_backend_cpu_aarch64_buffer_init_tensor,
/* .memset_tensor = */ ggml_backend_cpu_buffer_memset_tensor,
/* .set_tensor = */ ggml_backend_cpu_aarch64_buffer_set_tensor,
/* .get_tensor = */ NULL,
Copy link
Collaborator

@slaren slaren Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is going to be necessary to implement the get_tensor function to return a copy of the tensor in the original Q4_0 format. This is because some backends such as CUDA and Vulkan will offload computation of large batches to the GPU, and to do that the tensor needs to be copied to VRAM in the Q4_0 format. But it may be better to disable this behavior entirely for repacked tensors by only doing so when the weight is stored in a host buffer, since that ensures that no conversions are required.

Copy link
Contributor Author

@chaxu01 chaxu01 Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is going to be necessary to implement the get_tensor function to return a copy of the tensor in the original Q4_0 format. This is because some backends such as CUDA and Vulkan will offload computation of large batches to the GPU, and to do that the tensor needs to be copied to VRAM in the Q4_0 format.

I understand the requirement to revert the repacked weight data back to the original Q4_0 format in the get_tensor function to ensure compatibility with CUDA and Vulkan backends. This way, when these backends offload computations for large batches to the GPU, the tensor will be correctly formatted for transfer to VRAM. I will proceed with implementing this conversion logic.

But it may be better to disable this behavior entirely for repacked tensors by only doing so when the weight is stored in a host buffer, since that ensures that no conversions are required.

I would appreciate some additional clarification on this point. If I understand correctly, are you suggesting that we disable support for CUDA and Vulkan for tensors that have been repacked? This would imply that repacked tensors should not be used with GPU backends to avoid the need for conversion. Please confirm if our interpretation is accurate or if you have a different approach in mind.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct in your interpretation. My concern is that if we don't implement the logic to convert the tensors back to Q4_0 in the get_tensor function, then it will result in a performance loss when processing prompts with a GPU backend (even without layers offloaded, since the GPU is still used for processing prompts). However, in that case, the best solution may be to disable the weight repacking entirely, since repacking doesn't help with generation performance anyway, and prompt processing would be done in the GPU, so there is no reason to repack the weights at all. I think this will need to be handled in the llama.cpp instead and add an exception for this case.

Anyway, this is not an issue at the moment since it is only enabled for ARM currently. You don't need to do anything, I will solve the merge conflicts and merge this.

@slaren slaren merged commit 1607a5e into ggerganov:master Nov 15, 2024
53 checks passed
@chaxu01
Copy link
Contributor Author

chaxu01 commented Nov 15, 2024

@slaren thank you for your guidance and support in merging this PR. Your idea to handle the repacking logic more efficiently within llama.cpp makes a lot of sense, especially given the impact on GPU prompt processing. I’m grateful for your for taking care of the merge conflicts as well. Working on this project has been a great learning experience, thanks to your expertise and direction. Looking forward to contributing further as needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants