backend cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels #9921

chaxu01 · 2024-10-17T08:09:36Z

Added CPU backend online flow, allowing runtime requantization and repacking of Q4_0 to enable optimized GEMM and GEMV kernels. This feature can be enabled with the runtime option -rtrp (--runtime-repack).

Example of using the runtime option for benchmark on Graviton 3:

$./llama-bench -m phi-2.Q4_0.gguf -t 4 -rtrp 1,0

model	size	params	backend	threads	repack	test	t/s
phi2 3B Q4_0	1.49 GiB	2.78 B	CPU	4	1	pp512	110.84 ± 0.01
phi2 3B Q4_0	1.49 GiB	2.78 B	CPU	4	1	tg128	39.42 ± 0.02
phi2 3B Q4_0	1.49 GiB	2.78 B	CPU	4	0	pp512	38.03 ± 0.01
phi2 3B Q4_0	1.49 GiB	2.78 B	CPU	4	0	tg128	16.95 ± 0.01

I have read the contributing guidelines
Self-reported review complexity:
- Low
- [x ] Medium
- High

slaren · 2024-10-17T23:25:05Z

I think this is a step in the right direction, but I am not convinced about the current implementation. Generally, the way changes in tensor layout are intended to be implemented is through the ggml-backend buffer interface, it gives more control to the application over which tensors will be changed, it allows changes to the tensor size, and the conversion would be done at load time. Doing it this way may cause some tensors to be unintentionally converted, such as a quantized KV cache. However, the llama.cpp model loader at the moment does not have a good way to support this, but I am working on that at the moment.

Note that there are AVX implementations for Q4_0_8_8 gemm, so with a small change this can also benefit x86 processors (tested on 13900k):

model	size	params	backend	threads	repack	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	16	0	pp512	50.56 ± 0.47
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	16	0	tg128	20.79 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	16	1	pp512	64.79 ± 0.32
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	16	1	tg128	15.80 ± 0.05

patch

diff --git a/ggml/src/ggml-aarch64.c b/ggml/src/ggml-aarch64.c
index 700e66a0..4060d78e 100644
--- a/ggml/src/ggml-aarch64.c
+++ b/ggml/src/ggml-aarch64.c
@@ -3305,4 +3305,12 @@ void ggml_prepare_optimal_kernel(struct ggml_tensor *cur, uint8_t **pmem, size_t
         }
     }
 #endif
+
+#if defined(__AVX2__) || defined(__AVX512F__)
+    if (cur->type == GGML_TYPE_Q4_0) {
+        if (repack_q4_0_to_q4_0_8_bl(cur, 8, pmem, psize) == 0) {
+            cur->type = GGML_TYPE_Q4_0_8_8;
+        }
+    }
+#endif
 }

chaxu01 · 2024-10-18T12:51:37Z

@slaren Thank you for the review and valuable feedback!

I understand the direction you're suggesting, particularly with aligning with the ggml-backend buffer interface. Could you provide more details on this approach? Specifically, I’m curious how it would integrate, given that the mulmat tensor is currently constructed during the graph build, which occurs after the model loader.

I also wanted to ask about the timeline for the llama.cpp model loader improvements that would support this. If those changes aren’t expected to be completed soon, I suggest we merge the current PR with the necessary updates to ensure functionality in the short term. In parallel, we will start working on a more aligned implementation that integrates with the ggml-backend buffer interface.

Please let me know your thoughts on this.

slaren · 2024-10-18T15:55:09Z

I am working on the llama model loader at the moment. One of the changes that I will make is that it will be able to choose the buffer type to offload each tensor depending on the operations in which it will be used. This is mainly to prevent offloading tensors with types that are not supported by a backend, but it will also be useful for implementing this. It shouldn't take too long until this is merged.

I think that this approach is too error prone to merge as it is. There are at least two cases that I am aware of that will not work:

Models with a shared token embeddings and output tensor. For example, try gemma-2-2b-it quantized to Q4_0 with --pure.
KV quantization with -ctk q4_0

Both of these will crash with this PR. It may be possible to fix these issues specifically, but fundamentally the problem is that by modifying the tensors in the backend, this breaks the assumptions that applications make about the way ggml uses the tensors. It would be a constant source of problems, and it will be hard for other ggml applications to take advantage of this.

In the meanwhile, llama.cpp users can already take advantage of the performance boost of these types by converting the model beforehand using llama-quantize.

chaxu01 · 2024-10-21T07:37:11Z

@slaren Thank you for your detailed feedback. I'll hold off on this PR and wait for your patch that allows the model loader to choose buffer types based on tensor operations. Once that is in place, I'll refactor my implementation accordingly.

slaren · 2024-11-01T00:01:46Z

It should be possible now to implement this with a new buffer type as outlined in this comment. Please let me know if you find any issues.

chaxu01 · 2024-11-01T06:58:59Z

@slaren thank you for your guidance and for providing a clear direction on implementing the changes. I am currently working on refactoring the online flow patch to incorporate the new buffer type as outlined. I’ll make sure to reach out if I encounter any issues during the process.

chaxu01 · 2024-11-06T13:40:31Z

@slaren I've refactored the patch to add a new buffer type for runtime weight quantization on AARCH64 CPUs. This feature is enabled by the build option GGML_CPU_AARCH64.

slaren · 2024-11-06T13:43:46Z

Thanks. Looks like something went wrong in the last commit and thousands of files were added, I will try to review this after this is fixed.

chaxu01 · 2024-11-06T14:44:10Z

Hmm..I've tried reset to HEAD~1 and add the corresponding files. It still dosn't seem to be right. Not sure what went wrong but Is it OK I create a new PR for the patch and close this one?

slaren · 2024-11-06T14:50:57Z

Up to you. I think the changes to ggml-cpu.c are missing.

chaxu01 · 2024-11-06T14:59:10Z

I think the changes to ggml-cpu.c are missing

There are no changes to ggml-cpu.c from this patch.

slaren · 2024-11-06T15:06:23Z

I see that instead of checking the buffer type of the tensor in the matrix multiplication, it is changing the type of tensor. This should not be done, this will lead to other problems. Instead, check the buffer type of the tensor in the matrix multiplication and add the necessary logic to pick the correct gemm function there.

chaxu01 · 2024-11-06T16:35:48Z

@slaren Thanks for the quick review. Could you elaborate further on the issue? From a logical perspective, the weight tensor type changes as part of the weight reshape process.

slaren · 2024-11-06T17:47:12Z

The goal of using a buffer type to handle the conversion is to hide it from the user. From what the user is concerned, the tensor type is Q4_0, and if they use the get_tensor function to read it back, they should get the tensor in the original Q4_0 format. If you change the tensor type, we are back to the same kind of problems of the original implementation.

You can use tensor->extra to store other details about the tensor (eg. the actual converted type) in the init_tensor function of the buffer interface.

chaxu01 · 2024-11-07T10:08:05Z

@slaren thank you for the detailed explanation. I understand the importance of keeping the tensor type consistent to avoid the issues faced with the original implementation. I've updated the code to retain the tensor type as Q4_0 and instead use the buffer type to select the appropriate GEMM/GEMV functions. Please let me know if there are any other adjustments needed.

slaren · 2024-11-07T16:55:21Z

I have pushes some fixes, please check that everything still works as expected on aarch64 machines.

Question: what happens when the number of rows in the tensors is not a multiple of the interleave size? It looks to me that this would cause a buffer overflow. If so, that needs to be fixed so that the conversion is not used if the number of rows is not compatible.

I also found that the conversion is quite slow, particularly to q4_0_4x8. I imagine that this could be improved significantly by improving the memory access pattern and using multiple threads.

@Srihari-mcw tagging you since you wrote the AVX implementation of Q4_0_8_8. Compared to Q4_0, on my machine I found that it improves gemm performance significantly, but gemv performance is also much slower. If that was fixed, this could be enabled as well for x86 machines.

slaren · 2024-11-07T17:42:08Z

On a side note, we need a better name for these types than "aarch64" because at this point there are implementation for many other systems. Maybe "interleaved types".

chaxu01 · 2024-11-08T07:48:58Z

@slaren thank you for the review. regarding your question: Yes, you are correct. If the number of rows in the tensors is not a multiple of the interleave size, it would indeed cause issues, potentially leading to a buffer overflow.

I will verify that the patch still works as expected with the latest trunk and your fixes. Additionally, I will prepare a new commit to address the issues you raised, including ensuring proper handling for cases where the number of rows is incompatible.

chaxu01 · 2024-11-08T13:14:14Z

On a side note, we need a better name for these types than "aarch64" because at this point there are implementation for many other systems. Maybe "interleaved types".

Were you referring to renaming the new buffer type and the related functions? For instance, would updating the following function:

static const char * ggml_backend_cpu_aarch64_buffer_type_get_name(ggml_backend_buffer_type_t buft) {
    return "CPU_AARCH64";
    GGML_UNUSED(buft);
}

to something like this:

static const char * ggml_backend_cpu_q4_0_buffer_type_get_name(ggml_backend_buffer_type_t buft) {
    return "INTERLEAVED_Q4_0";
    GGML_UNUSED(buft);
}

slaren · 2024-11-08T13:18:46Z

Yes, something along these lines, but more generic. Rename ggml-aarch64.c/h to ggml-interleaved.c/h, the buffer type to ggml_backend_cpu_interleaved_buffer_type_get_name, etc. In the future it can be extended to support more types than Q4_0. It doesn't have to be done now.

chaxu01 · 2024-11-08T13:35:52Z

I have pushes some fixes, please check that everything still works as expected on aarch64 machines.

I've verified the PR #10196 on aarch64 machines, including Pixel8 and MacBook M3.

slaren · 2024-11-08T21:07:14Z

@Srihari-mcw I don't think it is specific to one model, I generally see lower generation performance with Q4_0_8_8 compared to Q4_0 on my machine. My CPU is an Intel 13900k. Here are some results with llama 7B, llama 3.1 8B, and tinyllama 1B:

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	16	pp128	43.14 ± 0.05
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	16	tg32	20.47 ± 0.08
llama 7B Q4_0_8_8	3.56 GiB	6.74 B	CPU	16	pp128	58.48 ± 0.08
llama 7B Q4_0_8_8	3.56 GiB	6.74 B	CPU	16	tg32	17.14 ± 0.06
llama 8B Q4_0	4.33 GiB	8.03 B	CPU	16	pp128	40.49 ± 0.15
llama 8B Q4_0	4.33 GiB	8.03 B	CPU	16	tg32	17.94 ± 0.03
llama 8B Q4_0_8_8	4.33 GiB	8.03 B	CPU	16	pp128	54.59 ± 0.04
llama 8B Q4_0_8_8	4.33 GiB	8.03 B	CPU	16	tg32	15.20 ± 0.06
llama 1B Q4_0	606.53 MiB	1.10 B	CPU	16	pp256	272.45 ± 0.60
llama 1B Q4_0	606.53 MiB	1.10 B	CPU	16	tg64	98.05 ± 0.70
llama 1B Q4_0_8_8	606.53 MiB	1.10 B	CPU	16	pp256	359.69 ± 0.50
llama 1B Q4_0_8_8	606.53 MiB	1.10 B	CPU	16	tg64	96.62 ± 0.60

chaxu01 · 2024-11-10T18:23:07Z

@slaren thank you for pointing that out. I have rebased my changes on top of commit a0a4646 to include the fixes you made. Please let me know if anything else is required.

eddnjjn · 2024-11-12T14:14:25Z

I also found that the conversion is quite slow, particularly to q4_0_4x8. I imagine that this could be improved significantly by improving the memory access pattern and using multiple threads.

I'm working on a patch to address this. The load-time for phi2 on a Pixel 8 went down from 9400ms to 1500ms.

ggml/CMakeLists.txt

slaren · 2024-11-14T01:05:51Z

ggml/src/ggml-backend.cpp

+    /* .init_tensor     = */ ggml_backend_cpu_aarch64_buffer_init_tensor,
+    /* .memset_tensor   = */ ggml_backend_cpu_buffer_memset_tensor,
+    /* .set_tensor      = */ ggml_backend_cpu_aarch64_buffer_set_tensor,
+    /* .get_tensor      = */ NULL,


It is going to be necessary to implement the get_tensor function to return a copy of the tensor in the original Q4_0 format. This is because some backends such as CUDA and Vulkan will offload computation of large batches to the GPU, and to do that the tensor needs to be copied to VRAM in the Q4_0 format. But it may be better to disable this behavior entirely for repacked tensors by only doing so when the weight is stored in a host buffer, since that ensures that no conversions are required.

It is going to be necessary to implement the get_tensor function to return a copy of the tensor in the original Q4_0 format. This is because some backends such as CUDA and Vulkan will offload computation of large batches to the GPU, and to do that the tensor needs to be copied to VRAM in the Q4_0 format.

I understand the requirement to revert the repacked weight data back to the original Q4_0 format in the get_tensor function to ensure compatibility with CUDA and Vulkan backends. This way, when these backends offload computations for large batches to the GPU, the tensor will be correctly formatted for transfer to VRAM. I will proceed with implementing this conversion logic.

But it may be better to disable this behavior entirely for repacked tensors by only doing so when the weight is stored in a host buffer, since that ensures that no conversions are required.

I would appreciate some additional clarification on this point. If I understand correctly, are you suggesting that we disable support for CUDA and Vulkan for tensors that have been repacked? This would imply that repacked tensors should not be used with GPU backends to avoid the need for conversion. Please confirm if our interpretation is accurate or if you have a different approach in mind.

You are correct in your interpretation. My concern is that if we don't implement the logic to convert the tensors back to Q4_0 in the get_tensor function, then it will result in a performance loss when processing prompts with a GPU backend (even without layers offloaded, since the GPU is still used for processing prompts). However, in that case, the best solution may be to disable the weight repacking entirely, since repacking doesn't help with generation performance anyway, and prompt processing would be done in the GPU, so there is no reason to repack the weights at all. I think this will need to be handled in the llama.cpp instead and add an exception for this case.

Anyway, this is not an issue at the moment since it is only enabled for ARM currently. You don't need to do anything, I will solve the merge conflicts and merge this.

chaxu01 · 2024-11-15T07:38:52Z

@slaren thank you for your guidance and support in merging this PR. Your idea to handle the repacking logic more efficiently within llama.cpp makes a lot of sense, especially given the impact on GPU prompt processing. I’m grateful for your for taking care of the merge conflicts as well. Working on this project has been a great learning experience, thanks to your expertise and direction. Looking forward to contributing further as needed.

slaren · 2024-11-15T12:00:32Z

It's more likely caused by #10256. PRs to fix broken builds with the Makefile are welcome, but keep in mind that it may be removed entirely in the near future (see #10268 for more details).

Srihari-mcw · 2024-11-15T16:48:49Z

Hi @slaren , post your comments, we tried to test and replicate the issue on an Intel Platform

Meta 7B Llama2 model was used for testing. The tests were done on Intel I5 13600K (Raptor Lake) Machine
Currently deficits with text generation were not seen with the above testing done. The results are as follows:

model	size	params	backend	threads	test	t/s	speedup	Commit id
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	14	pp 512	44.53 ± 0.29		54ef9cfc
llama 7B Q4_0_8_8	3.56 GiB	6.74 B	CPU	14	pp 512	51.95 ± 0.45	16.68%	54ef9cfc
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	14	tg 32	19.85 ± 0.21		54ef9cfc
llama 7B Q4_0_8_8	3.56 GiB	6.74 B	CPU	14	tg 32	20.30 ± 0.70	2.89%	54ef9cfc
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	14	tg 64	19.80 ± 0.17		54ef9cfc
llama 7B Q4_0_8_8	3.56 GiB	6.74 B	CPU	14	tg 64	20.36 ± 0.30	2.61%	54ef9cfc
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	14	tg 128	19.69 ± 0.15		54ef9cfc
llama 7B Q4_0_8_8	3.56 GiB	6.74 B	CPU	14	tg 128	20.33 ± 0.16	3.30%	54ef9cfc

Command for testing llama-bench.exe -m model -n < number of tokens to generate>

The machine supports the following flags :

Compiler : GCC 12.3

OS : Linux

Thanks

slaren · 2024-11-15T17:31:19Z

@Srihari-mcw

Thanks for testing. I repeated the tests with different numbers of threads. My environment is Linux, under Windows 11 WSL. I also changed the compiler to GCC 12.3 to match yours. Previously I was using GCC 13, but the results are similar with both versions.

It seems that the Q4_0_8_8 implementation is more sensitive to the number of threads used than Q4_0. Generally, it performs better as the number of threads is increased, although with the default number of threads used in llama.cpp with this CPU (16), it results in a loss of performance.

model	size	params	backend	threads	test	t/s	speedup
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	8	tg32	20.68 ± 0.12
llama 7B Q4_0_8_8	3.56 GiB	6.74 B	CPU	8	tg32	18.74 ± 0.27	0.906
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	12	tg32	21.00 ± 0.09
llama 7B Q4_0_8_8	3.56 GiB	6.74 B	CPU	12	tg32	16.33 ± 0.05	0.777
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	16	tg32	21.46 ± 0.09
llama 7B Q4_0_8_8	3.56 GiB	6.74 B	CPU	16	tg32	17.53 ± 0.05	0.816
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	20	tg32	18.10 ± 0.06
llama 7B Q4_0_8_8	3.56 GiB	6.74 B	CPU	20	tg32	18.98 ± 0.02	1.048
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	24	tg32	19.13 ± 0.04
llama 7B Q4_0_8_8	3.56 GiB	6.74 B	CPU	24	tg32	20.27 ± 0.06	1.059
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	28	tg32	19.84 ± 0.05
llama 7B Q4_0_8_8	3.56 GiB	6.74 B	CPU	28	tg32	20.77 ± 0.05	1.046
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	32	tg32	17.68 ± 0.52
llama 7B Q4_0_8_8	3.56 GiB	6.74 B	CPU	32	tg32	17.90 ± 0.29	1.012

…-org#9921) * backend-cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels --------- Co-authored-by: Diego Devesa <[email protected]>

Srihari-mcw · 2024-11-25T17:19:14Z

Hi @slaren ,

We did some more testing at our side following your observations. Testing was done in two machines AMD Raphael 7600X and Intel I5 13600K and results were tested with different threads

The performance for Q4_0_8_8 w.r.t threads for different configurations were pretty stable for the configs tested in AMD Raphael 7600X (4 - 12 threads). The performance was either better or equal in this machine's case

GCC 12.3 Linux

model	size	params	backend	threads	test	t/s	speedup	Commit id
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	4	tg 128	13.88 ± 0.00		54ef9cfc
llama 7B Q4_0_8_8	3.56 GiB	6.74 B	CPU	4	tg 128	14.88 ± 0.00	6.72%	54ef9cfc
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	6 (Default)	tg 128	14.90 ± 0.0		54ef9cfc
llama 7B Q4_0_8_8	3.56 GiB	6.74 B	CPU	6 (Default)	tg 128	14.88 ± 0.0	-0.13%	54ef9cfc
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	8	tg 128	14.73 + 0.0		54ef9cfc
llama 7B Q4_0_8_8	3.56 GiB	6.74 B	CPU	8	tg 128	14.76 + 0.01	0.20%	54ef9cfc
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	12	tg 128	14.64 + 0.01		54ef9cfc
llama 7B Q4_0_8_8	3.56 GiB	6.74 B	CPU	12	tg 128	14.76 + 0.03	0.81%	54ef9cfc

The variation of performance w.r.t threads were replicated in the Intel I5 13600K for Q4_0_8_8 quantization. For default thread configuration however (14) and higher thread config (20), we are seeing gains in the machine

model	size	params	backend	threads	test	t/s	speedup	Commit id
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	8	tg 128	19.77 ± 0.03		54ef9cfc
llama 7B Q4_0_8_8	3.56 GiB	6.74 B	CPU	8	tg 128	13.65 ± 0.02	-30.94%	54ef9cfc
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	12	tg 128	19.85 ± 0.21		54ef9cfc
llama 7B Q4_0_8_8	3.56 GiB	6.74 B	CPU	12	tg 128	20.30 ± 0.70	-23.88%	54ef9cfc
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	14 (Default)	tg 128	19.68 ± 0.15		54ef9cfc
llama 7B Q4_0_8_8	3.56 GiB	6.74 B	CPU	14 (Default)	tg 128	20.33 ± 0.16	3.30%	54ef9cfc
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	16	tg 128	21.77 ± 0.03		54ef9cfc
llama 7B Q4_0_8_8	3.56 GiB	6.74 B	CPU	16	tg 128	18.16 ± 0.05	-16.58%	54ef9cfc
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	20	tg 128	17.96 ± 0.01		54ef9cfc
llama 7B Q4_0_8_8	3.56 GiB	6.74 B	CPU	20	tg 128	19.46 ± 0.02	8.35%	54ef9cfc

We did try to use some ideas from GEMM to try improve GEMV's performance and make text generation further better, but currently we are getting similar performance numbers as before for text generation. Thanks

bartowski1182 · 2024-11-29T17:26:35Z

Is my understanding correct that there's now no point in releasing Q4_0_X_X quants, and all ARM and AVX builds that want to take advantage of those speedups should just use Q4_0 moving forward? And IQ4_NL thanks to this PR #10541 ?

slaren · 2024-11-29T17:39:16Z

For Q4_0, that's the goal. Support for Q4_0_X_X model files is likely to be removed in the future. However, it is not enabled yet for AVX because in some cases it may result in worse performance. IQ4_NL models are still necessary to take advantage of the changes in #10541, they will just be faster on ARM. Basically instead of adding IQ4_NL_X_X types, the conversion is automatic during load time, but you still need a IQ4_NL model.

Vali-98 · 2024-11-30T11:59:32Z

I'm wary on deprecating Q4_0_4_8 and Q4_0_4_4 as the the asm implementation seems to be about a ~12% difference on android devices. We cannot use Q4_0_8_8 as most modern SOCs (aside the Pixel) do not support SVE intrinsics.

Test on a Poco F5 (Snapdragon 7+ Gen 2)

model	size	params	test	t/s
llama 1B Q4_0_4_8	0.71 GiB	1.24B	pp 512	169.06 ± 9.21
llama 1B Q4_0_4_8	0.71 GiB	1.24B	tg 128	33.53 ± 0.83
llama 1B Q4_0	0.71 GiB	1.24B	pp 512	149.77 ± 19.69
llama 1B Q4_0	0.71 GiB	1.24B	tg 128	30.10 ± 1.96

As a performance chaser, this would be saddening to see deprecated, but as an app developer, this is amazing for user convenience.

ggerganov · 2024-12-01T11:41:55Z

AFAIK there shouldn't be any perf difference from these changes. Your results seem to be within the reported uncertainties, so likely you are observing noise variations.

Vali-98 · 2024-12-02T03:47:01Z

Ran a few more tests, and it does seem to even out. Sadly didn't record results, but for higher end models like 8b, the difference was within a margin of error. The initial hesitation on my part seems to be unfounded.

LostRuins · 2025-01-12T09:28:55Z

Somewhat superficial, but could we do something about this warning:

In function 'block_q4_0x4 make_block_q4_0x4(block_q4_0*, unsigned int)',
    inlined from 'int repack_q4_0_to_q4_0_4_bl(ggml_tensor*, int, const void*, size_t)' at ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp:3707:39:
ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp:3629:19: warning: writing 16 bytes into a region of size 0 [-Wstringop-overflow=]
 3629 |             memcpy(&out.qs[dst_offset], &elems, sizeof(uint64_t));
      |             ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

only from ggml-cpu-aarch64.cpp

doonny · 2025-03-15T15:04:25Z

According to perf, the online repack consumes a significant amount of cpu runtime. Why don't you keep it offline ?

github-actions bot added examples ggml changes relating to the ggml tensor library for machine learning labels Oct 17, 2024

chaxu01 force-pushed the feature/online-flow branch from e7974bb to c9c1afb Compare October 17, 2024 08:49

chaxu01 force-pushed the feature/online-flow branch from c9c1afb to ba21c82 Compare November 6, 2024 13:33

chaxu01 force-pushed the feature/online-flow branch from ba21c82 to e44a529 Compare November 6, 2024 14:39

slaren mentioned this pull request Nov 7, 2024

Introduce IQ4_NL_4_4 format and its neon implementation #10196

Closed

4 tasks

chaxu01 added 3 commits November 8, 2024 17:06

backend-cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels

647eb31

refactor add new buffer type for online flow

b632bf0

retain the tensor type as Q4_0

5947d72

rebased onto commit a0a4646

76d8975

fix build error

2268ce0

slaren reviewed Nov 14, 2024

View reviewed changes

Update ggml/CMakeLists.txt

74d660a

slaren mentioned this pull request Nov 14, 2024

llama : use ggml_backend_dev_get_extra_bufts #10228

Closed

2 tasks

Merge remote-tracking branch 'origin/master' into feature/online-flow

749a9e5

slaren force-pushed the feature/online-flow branch from 1875733 to 749a9e5 Compare November 14, 2024 21:42

slaren approved these changes Nov 14, 2024

View reviewed changes

slaren merged commit 1607a5e into ggml-org:master Nov 15, 2024
53 checks passed

ggerganov mentioned this pull request Nov 17, 2024

ggml : reintegrate the AMX backend into the CPU backend #10359

Closed

AndreasKunar mentioned this pull request Nov 20, 2024

Bug: Severe Performance Degradation on Q4_0 CPU-only with MacOS / Apple Silicon M2, after PR#9921 / Version 4081 #10435

Open

ekcrisp mentioned this pull request Dec 9, 2024

Add an option to enable --runtime-repack in llama.cpp abetlen/llama-cpp-python#1860

Closed

backend cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels #9921

backend cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels #9921

Uh oh!

Conversation

chaxu01 commented Oct 17, 2024 • edited by Green-Sky Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Oct 17, 2024

Uh oh!

chaxu01 commented Oct 18, 2024

Uh oh!

slaren commented Oct 18, 2024

Uh oh!

chaxu01 commented Oct 21, 2024

Uh oh!

slaren commented Nov 1, 2024

Uh oh!

chaxu01 commented Nov 1, 2024

Uh oh!

chaxu01 commented Nov 6, 2024

Uh oh!

slaren commented Nov 6, 2024

Uh oh!

chaxu01 commented Nov 6, 2024

Uh oh!

slaren commented Nov 6, 2024

Uh oh!

chaxu01 commented Nov 6, 2024

Uh oh!

slaren commented Nov 6, 2024

Uh oh!

chaxu01 commented Nov 6, 2024

Uh oh!

slaren commented Nov 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chaxu01 commented Nov 7, 2024

Uh oh!

slaren commented Nov 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Nov 7, 2024

Uh oh!

chaxu01 commented Nov 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chaxu01 commented Nov 8, 2024

Uh oh!

slaren commented Nov 8, 2024

Uh oh!

chaxu01 commented Nov 8, 2024

Uh oh!

slaren commented Nov 8, 2024

Uh oh!

chaxu01 commented Nov 10, 2024

Uh oh!

eddnjjn commented Nov 12, 2024

Uh oh!

Uh oh!

slaren Nov 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chaxu01 Nov 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

slaren Nov 14, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chaxu01 commented Nov 15, 2024

Uh oh!

slaren commented Nov 15, 2024

Uh oh!

Srihari-mcw commented Nov 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Nov 15, 2024

Uh oh!

chaxu01 commented Oct 17, 2024 •

edited by Green-Sky

Loading

slaren commented Nov 6, 2024 •

edited

Loading

slaren commented Nov 7, 2024 •

edited

Loading

chaxu01 commented Nov 8, 2024 •

edited

Loading

slaren Nov 14, 2024 •

edited

Loading

chaxu01 Nov 14, 2024 •

edited

Loading

Srihari-mcw commented Nov 15, 2024 •

edited

Loading

Srihari-mcw commented Nov 25, 2024 •

edited

Loading

slaren commented Nov 29, 2024 •

edited

Loading

Vali-98 commented Nov 30, 2024 •

edited

Loading