Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce IQ4_NL_4_4 format and its neon implementation #10196

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

FanShupei
Copy link
Contributor

The PR is not well polished. I'd like to hear community's feedback to complete it.

Motivation: Q4_0_X_X is very fast but the accuracy of Q4_0 is not good. IQ4_NL is much better than Q4_0 and they have compatible structure. Therefore, I introduce IQ4_NL_X_X to have benefits of both.

PR Content: This PR may be reviewed per commit.

  • commit 1 & 2: rewrite q4_0_4_4_gemv/gemm NEON implementation by intrinsics. (They are technically not required for this PR)
  • commit 3: Introduce IQ4_NL_4_4 format with unoptimized gemv/gemm code.
  • commit 4: fast iq4_nl_4_4_gemv/gemm NEON implementation. The code is adopted from commit 1 & 2.

Addition comments:

  1. I only introduce IQ4_NL_4_4 as a PoC. If requested, I could introduce IQ4_NL_4_8 and IQ4_NL_8_8 as well.
  2. The q4_0_4_4 gemm rewrite is not equivalent to the asm version in terms of code logic. The speed is only 90% of the asm version, tested on Mac M2.

@github-actions github-actions bot added examples ggml changes relating to the ggml tensor library for machine learning labels Nov 6, 2024
@ExtReMLapin
Copy link
Contributor

Feel free to post benchmarks and perplexity results

@FanShupei
Copy link
Contributor Author

For perplexity, since Q4_0_X_X is numerically equivalent to Q4_0, it is actually the comparison between Q4_0 and IQ4_NL. The PR #5590 that introduced IQ4_NL provides some comparisions. My personal everyday experience is that IQ4_NL is more stable than Q4_0 without imatrix.

IQ4_NL_4_4's speed is slower than Q4_0_4_4, but still much faster than Q4_0/IQ4_NL. I will benchmark it on Mac M2 and Orangepi(RK3588) tomorrow, and then post benchmark results.

@FanShupei
Copy link
Contributor Author

This my results tested on Mac M2:

Model: LLAMA 3.2 1B
Thread: 1

First two are tested on BLAS backend (compiled with DGGML_METAL=OFF),
other results are tested on CPU Backend (compiled with DGGML_METAL=OFF DGGML_BLAS=OFF

method pp256 tg64
Q4_0 (BLAS) 272.61 ± 0.27 26.45 ± 0.00
IQ4_NL (BLAS) 215.83 ± 0.04 22.52 ± 0.02
Q4_0 60.40 ± 0.54 26.58 ± 0.02
IQ4_NL 30.86 ± 0.02 22.53 ± 0.01
Q4_0_4_4 (ASM) 131.45 ± 0.24 40.74 ± 0.08
Q4_0_4_4 (intrinsic) 118.84 ± 0.04 40.83 ± 0.03
IQ4_NL_4_4 (intrinsic) 105.11 ± 0.65 34.85 ± 0.03

For raw results: see https://pastebin.com/z0jDzDLd

NOTE: I tested it on Mac just for convenience. The format is more useful on Arm devices without a fast GPU or BLAS backend, like Raspberry Pi or OrangePi.

@FanShupei
Copy link
Contributor Author

This my results tested on OrangePi:

Model: LLAMA 3.2 1B
Thread: 1

method pp128 tg32
Q4_0 12.11 ± 0.00 5.81 ± 0.02
IQ4_NL 8.68 ± 0.00 5.81 ± 0.02
Q4_0_4_4 (ASM) 36.46 ± 0.01 9.60 ± 0.01
Q4_0_4_4 (intrinsic) 30.57 ± 0.14 9.44 ± 0.01
IQ4_NL_4_4 (intrinsic) 26.56 ± 0.01 8.23 ± 0.14

For raw results: see https://pastebin.com/rDRJHWBx

@slaren
Copy link
Collaborator

slaren commented Nov 7, 2024

The performance improvement looks very good. We should probably avoid adding new file types because there are too many already, but it could be done via online conversion after #9921 is merged.

@ggerganov
Copy link
Owner

I'd like to hear community's feedback to complete it.

These changes are welcome. We should figure out what to do with the assembly code. I like the intrinsics implementation because it's easier to understand and maintain. The performance discrepancy is not massive (I think similar to what was mentioned in the original PR #5780).

We should probably avoid adding new file types because there are too many already, but it could be done via online conversion after #9921 is merged.

Agree.

If requested, I could introduce IQ4_NL_4_8 and IQ4_NL_8_8 as well.

I'm even interested in intrinsics-based code for Q4_0_4_8 and Q4_0_8_8 as I don't think I've yet seen how to utilize the SVE / MMINT8 instruction sets, apart from the obscure assembly code that we have atm.

@FanShupei
Copy link
Contributor Author

The performance improvement looks very good. We should probably avoid adding new file types because there are too many already, but it could be done via online conversion after #9921 is merged.

Agree, too. I implement it by adding new format simply because the online conversion is not mature, we can switch to it later.

My only concern with online conversion is whether it will increase memory usage, or increase peak memory usage during conversion (double at worst). If so, it maybe not good to users who run on edge devices like Raspberry Pi.

We should figure out what to do with the assembly code. I like the intrinsics implementation because it's easier to understand and maintain.

Seems people are interested in the intrinsics implementation. I'd like to explain it more here. I write the intrinsic version by "study the asm version". Though not a perfect one-to-one rewrite, the code structure follows.

For GEMV, the code logic is roughly the same, with some minor code reodering to make it more readable. Since their performance is pretty closed, I feel it's good enough.

For GEMM, the intrinsic version is a simplified version without loop unrolling. The asm performs unrolling by factor 4. The unrolling can not be achieved by simply inserting a '#pragma' because. My intrinsic version is roughly the same with the tail loop of asm version.

I actually tried various methods of unrolling to close the performance gap (~20% is acceptable for still large), including the unrolling strategy the same ASM to version. Unfortunately all my trial failed, all unrolling method brings less than 5% speed up. So I just post the version w/o unrolling since it's much cleaner. I'm still trying to dig out why the performance gaps exists.

I don't understand why we only have ASM version from the day one. May we contact the authors if they are willing to provide the source code, if the asm is generated by compiler? Decompiling asm manually is not a fun work.

@FanShupei
Copy link
Contributor Author

I notice that we use ggml_cpu_has_neon to dispatch at runtime.

I worry it won't work as expected if we switch to intrinsics. If features are not enabled at compile time, the intrinsics won's compile. It it's enabled at compile time, the compiler may introduce simd instructions in base implementation by auto-vectorization. This is why the CI fails, but I currently has no idea how to fix it.

@slaren
Copy link
Collaborator

slaren commented Nov 8, 2024

My only concern with online conversion is whether it will increase memory usage, or increase peak memory usage during conversion (double at worst). If so, it maybe not good to users who run on edge devices like Raspberry Pi.

Tensors are converted one at a time, so the peak memory usage would only increase by the size of the largest tensor (at least with mmap disabled). Even then, other buffers allocated later like the KV cache are likely to be bigger than this, so overall it should not result in higher memory requirements.

@FanShupei
Copy link
Contributor Author

Since #9921 is merged, I'll try to support iq4_nl_4_4 based online conversion. This PR is effectively abandoned, but I'd like to keep it open before my new pr is ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants