-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[Perf] [CPU] eliminate redundant memory access in group query attention #13319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: ZelinMa557 <[email protected]>
Just tested this out of curiosity: Qwen 3 degrades in quality (ignores |
Hi, thanks for your reply! It do not break compatibility with older models in theory, but there might be small bugs in my implementation. In my test, it works with Qwen 2.5 7b. Can you tell me the Qwen3 model size you used to test? I will test both qwen3 and mistral to debug. |
I've tested both 8b and 4b models in Q6, both worked correctly without this PR. Mistral Small 2 is in Q5_K_L, works correctly on main too. |
Thanks, I have reproduced the same problem. I will try to fix it. |
Signed-off-by: ZelinMa557 <[email protected]>
I have fixed the bug. Are there any scripts to format the code locally? This pr cannot pass the code lint now |
Thank you! I've already deleted Qwen models, unfortunately, but Mistral Small 2 generates text correctly now. I'll test it a bit more with other models, but so far it seems to be fixed. On i7 8700 with Mistral Small 3 (the 24b one, q4_k_m) I get 2.08t/s with this PR vs 1.97t/s on current main. |
Signed-off-by: ZelinMa557 <[email protected]>
Modern LLMs (Llama3, qwen 2.5, etc) usually use group query attention, which significantly reduces memory usage caused by KV cache. Group query attention means that query rows of neighbor query heads share kv rows of the same kv head, so we can reorder the loop to:
to improve spatial locality of memory access. However the original implemention of cpu flash attention kernel didn't consider that, and this pr improves it.
This is my test command:
The mastrer branch result:
With the optimization, the result is:
We can see slight speed up in prefill, and 25% speed up in decode!
Further work:
My test environment: