Introduce IQ4_NL_4_4 format and its neon implementation #10196

FanShupei · 2024-11-06T14:41:45Z

The PR is not well polished. I'd like to hear community's feedback to complete it.

Motivation: Q4_0_X_X is very fast but the accuracy of Q4_0 is not good. IQ4_NL is much better than Q4_0 and they have compatible structure. Therefore, I introduce IQ4_NL_X_X to have benefits of both.

PR Content: This PR may be reviewed per commit.

commit 1 & 2: rewrite q4_0_4_4_gemv/gemm NEON implementation by intrinsics. (They are technically not required for this PR)
commit 3: Introduce IQ4_NL_4_4 format with unoptimized gemv/gemm code.
commit 4: fast iq4_nl_4_4_gemv/gemm NEON implementation. The code is adopted from commit 1 & 2.

Addition comments:

I only introduce IQ4_NL_4_4 as a PoC. If requested, I could introduce IQ4_NL_4_8 and IQ4_NL_8_8 as well.
The q4_0_4_4 gemm rewrite is not equivalent to the asm version in terms of code logic. The speed is only 90% of the asm version, tested on Mac M2.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

ExtReMLapin · 2024-11-06T16:26:23Z

Feel free to post benchmarks and perplexity results

FanShupei · 2024-11-06T17:00:33Z

For perplexity, since Q4_0_X_X is numerically equivalent to Q4_0, it is actually the comparison between Q4_0 and IQ4_NL. The PR #5590 that introduced IQ4_NL provides some comparisions. My personal everyday experience is that IQ4_NL is more stable than Q4_0 without imatrix.

IQ4_NL_4_4's speed is slower than Q4_0_4_4, but still much faster than Q4_0/IQ4_NL. I will benchmark it on Mac M2 and Orangepi(RK3588) tomorrow, and then post benchmark results.

FanShupei · 2024-11-07T14:02:56Z

This my results tested on Mac M2:

Model: LLAMA 3.2 1B
Thread: 1

First two are tested on BLAS backend (compiled with DGGML_METAL=OFF),
other results are tested on CPU Backend (compiled with DGGML_METAL=OFF DGGML_BLAS=OFF

method	pp256	tg64
Q4_0 (BLAS)	272.61 ± 0.27	26.45 ± 0.00
IQ4_NL (BLAS)	215.83 ± 0.04	22.52 ± 0.02
Q4_0	60.40 ± 0.54	26.58 ± 0.02
IQ4_NL	30.86 ± 0.02	22.53 ± 0.01
Q4_0_4_4 (ASM)	131.45 ± 0.24	40.74 ± 0.08
Q4_0_4_4 (intrinsic)	118.84 ± 0.04	40.83 ± 0.03
IQ4_NL_4_4 (intrinsic)	105.11 ± 0.65	34.85 ± 0.03

For raw results: see https://pastebin.com/z0jDzDLd

NOTE: I tested it on Mac just for convenience. The format is more useful on Arm devices without a fast GPU or BLAS backend, like Raspberry Pi or OrangePi.

FanShupei · 2024-11-07T15:13:19Z

This my results tested on OrangePi:

Model: LLAMA 3.2 1B
Thread: 1

method	pp128	tg32
Q4_0	12.11 ± 0.00	5.81 ± 0.02
IQ4_NL	8.68 ± 0.00	5.81 ± 0.02
Q4_0_4_4 (ASM)	36.46 ± 0.01	9.60 ± 0.01
Q4_0_4_4 (intrinsic)	30.57 ± 0.14	9.44 ± 0.01
IQ4_NL_4_4 (intrinsic)	26.56 ± 0.01	8.23 ± 0.14

For raw results: see https://pastebin.com/rDRJHWBx

slaren · 2024-11-07T17:45:06Z

The performance improvement looks very good. We should probably avoid adding new file types because there are too many already, but it could be done via online conversion after #9921 is merged.

ggerganov · 2024-11-07T21:29:40Z

I'd like to hear community's feedback to complete it.

These changes are welcome. We should figure out what to do with the assembly code. I like the intrinsics implementation because it's easier to understand and maintain. The performance discrepancy is not massive (I think similar to what was mentioned in the original PR #5780).

We should probably avoid adding new file types because there are too many already, but it could be done via online conversion after #9921 is merged.

Agree.

If requested, I could introduce IQ4_NL_4_8 and IQ4_NL_8_8 as well.

I'm even interested in intrinsics-based code for Q4_0_4_8 and Q4_0_8_8 as I don't think I've yet seen how to utilize the SVE / MMINT8 instruction sets, apart from the obscure assembly code that we have atm.

FanShupei · 2024-11-08T12:14:42Z

The performance improvement looks very good. We should probably avoid adding new file types because there are too many already, but it could be done via online conversion after #9921 is merged.

Agree, too. I implement it by adding new format simply because the online conversion is not mature, we can switch to it later.

My only concern with online conversion is whether it will increase memory usage, or increase peak memory usage during conversion (double at worst). If so, it maybe not good to users who run on edge devices like Raspberry Pi.

We should figure out what to do with the assembly code. I like the intrinsics implementation because it's easier to understand and maintain.

Seems people are interested in the intrinsics implementation. I'd like to explain it more here. I write the intrinsic version by "study the asm version". Though not a perfect one-to-one rewrite, the code structure follows.

For GEMV, the code logic is roughly the same, with some minor code reodering to make it more readable. Since their performance is pretty closed, I feel it's good enough.

For GEMM, the intrinsic version is a simplified version without loop unrolling. The asm performs unrolling by factor 4. The unrolling can not be achieved by simply inserting a '#pragma' because. My intrinsic version is roughly the same with the tail loop of asm version.

I actually tried various methods of unrolling to close the performance gap (~20% is acceptable for still large), including the unrolling strategy the same ASM to version. Unfortunately all my trial failed, all unrolling method brings less than 5% speed up. So I just post the version w/o unrolling since it's much cleaner. I'm still trying to dig out why the performance gaps exists.

I don't understand why we only have ASM version from the day one. May we contact the authors if they are willing to provide the source code, if the asm is generated by compiler? Decompiling asm manually is not a fun work.

FanShupei · 2024-11-08T12:21:45Z

I notice that we use ggml_cpu_has_neon to dispatch at runtime.

I worry it won't work as expected if we switch to intrinsics. If features are not enabled at compile time, the intrinsics won's compile. It it's enabled at compile time, the compiler may introduce simd instructions in base implementation by auto-vectorization. This is why the CI fails, but I currently has no idea how to fix it.

slaren · 2024-11-08T13:21:48Z

My only concern with online conversion is whether it will increase memory usage, or increase peak memory usage during conversion (double at worst). If so, it maybe not good to users who run on edge devices like Raspberry Pi.

Tensors are converted one at a time, so the peak memory usage would only increase by the size of the largest tensor (at least with mmap disabled). Even then, other buffers allocated later like the KV cache are likely to be bigger than this, so overall it should not result in higher memory requirements.

…m/gemv

FanShupei · 2024-11-15T10:14:07Z

Since #9921 is merged, I'll try to support iq4_nl_4_4 based online conversion. This PR is effectively abandoned, but I'd like to keep it open before my new pr is ready.

FanShupei added 4 commits November 5, 2024 19:34

[ggml-aarch64] use intrinsics in q4_0_4_4 gemv

102299e

[ggml-aarch64] use intrinsics in q4_0_4_4 gemm

561b7f2

[ggml] add iq4_nl_4x4 format

75e8cbb

[ggml-aarch64] use intrinsics for iq4_nl_4x4 gemv&gemm

32e0862

github-actions bot added examples ggml changes relating to the ggml tensor library for machine learning labels Nov 6, 2024

chaxu01 mentioned this pull request Nov 8, 2024

backend cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels #9921

Merged

3 tasks

[ggml-aarch64] impl the same logic as the ASM version in q4_0_4_4 gem…

c7a54d1

…m/gemv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce IQ4_NL_4_4 format and its neon implementation #10196

Introduce IQ4_NL_4_4 format and its neon implementation #10196

FanShupei commented Nov 6, 2024

ExtReMLapin commented Nov 6, 2024

FanShupei commented Nov 6, 2024

FanShupei commented Nov 7, 2024

FanShupei commented Nov 7, 2024

slaren commented Nov 7, 2024

ggerganov commented Nov 7, 2024

FanShupei commented Nov 8, 2024

FanShupei commented Nov 8, 2024

slaren commented Nov 8, 2024

FanShupei commented Nov 15, 2024

Introduce IQ4_NL_4_4 format and its neon implementation #10196

Are you sure you want to change the base?

Introduce IQ4_NL_4_4 format and its neon implementation #10196

Conversation

FanShupei commented Nov 6, 2024

ExtReMLapin commented Nov 6, 2024

FanShupei commented Nov 6, 2024

FanShupei commented Nov 7, 2024

FanShupei commented Nov 7, 2024

slaren commented Nov 7, 2024

ggerganov commented Nov 7, 2024

FanShupei commented Nov 8, 2024

FanShupei commented Nov 8, 2024

slaren commented Nov 8, 2024

FanShupei commented Nov 15, 2024