-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce IQ4_NL_4_4 format and its neon implementation #10196
base: master
Are you sure you want to change the base?
Conversation
Feel free to post benchmarks and perplexity results |
For perplexity, since Q4_0_X_X is numerically equivalent to Q4_0, it is actually the comparison between Q4_0 and IQ4_NL. The PR #5590 that introduced IQ4_NL provides some comparisions. My personal everyday experience is that IQ4_NL is more stable than Q4_0 without imatrix. IQ4_NL_4_4's speed is slower than Q4_0_4_4, but still much faster than Q4_0/IQ4_NL. I will benchmark it on Mac M2 and Orangepi(RK3588) tomorrow, and then post benchmark results. |
This my results tested on Mac M2: Model: LLAMA 3.2 1B First two are tested on BLAS backend (compiled with
For raw results: see https://pastebin.com/z0jDzDLd NOTE: I tested it on Mac just for convenience. The format is more useful on Arm devices without a fast GPU or BLAS backend, like Raspberry Pi or OrangePi. |
This my results tested on OrangePi: Model: LLAMA 3.2 1B
For raw results: see https://pastebin.com/rDRJHWBx |
The performance improvement looks very good. We should probably avoid adding new file types because there are too many already, but it could be done via online conversion after #9921 is merged. |
These changes are welcome. We should figure out what to do with the assembly code. I like the intrinsics implementation because it's easier to understand and maintain. The performance discrepancy is not massive (I think similar to what was mentioned in the original PR #5780).
Agree.
I'm even interested in intrinsics-based code for Q4_0_4_8 and Q4_0_8_8 as I don't think I've yet seen how to utilize the SVE / MMINT8 instruction sets, apart from the obscure assembly code that we have atm. |
Agree, too. I implement it by adding new format simply because the online conversion is not mature, we can switch to it later. My only concern with online conversion is whether it will increase memory usage, or increase peak memory usage during conversion (double at worst). If so, it maybe not good to users who run on edge devices like Raspberry Pi.
Seems people are interested in the intrinsics implementation. I'd like to explain it more here. I write the intrinsic version by "study the asm version". Though not a perfect one-to-one rewrite, the code structure follows. For GEMV, the code logic is roughly the same, with some minor code reodering to make it more readable. Since their performance is pretty closed, I feel it's good enough. For GEMM, the intrinsic version is a simplified version without loop unrolling. The asm performs unrolling by factor 4. The unrolling can not be achieved by simply inserting a '#pragma' because. My intrinsic version is roughly the same with the tail loop of asm version. I actually tried various methods of unrolling to close the performance gap (~20% is acceptable for still large), including the unrolling strategy the same ASM to version. Unfortunately all my trial failed, all unrolling method brings less than 5% speed up. So I just post the version w/o unrolling since it's much cleaner. I'm still trying to dig out why the performance gaps exists. I don't understand why we only have ASM version from the day one. May we contact the authors if they are willing to provide the source code, if the asm is generated by compiler? Decompiling asm manually is not a fun work. |
I notice that we use I worry it won't work as expected if we switch to intrinsics. If features are not enabled at compile time, the intrinsics won's compile. It it's enabled at compile time, the compiler may introduce simd instructions in base implementation by auto-vectorization. This is why the CI fails, but I currently has no idea how to fix it. |
Tensors are converted one at a time, so the peak memory usage would only increase by the size of the largest tensor (at least with mmap disabled). Even then, other buffers allocated later like the KV cache are likely to be bigger than this, so overall it should not result in higher memory requirements. |
Since #9921 is merged, I'll try to support iq4_nl_4_4 based online conversion. This PR is effectively abandoned, but I'd like to keep it open before my new pr is ready. |
The PR is not well polished. I'd like to hear community's feedback to complete it.
Motivation: Q4_0_X_X is very fast but the accuracy of Q4_0 is not good. IQ4_NL is much better than Q4_0 and they have compatible structure. Therefore, I introduce IQ4_NL_X_X to have benefits of both.
PR Content: This PR may be reviewed per commit.
Addition comments: