Replies: 1 comment 2 replies
-
FA should work with all quants. There haven't been any major updates to the Metal and CPU kernels recently. I'm not sure how relevant the information in the Feature Matrix is today. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
In a quant testing post the author linked the Feature Matrix wiki page.
It says that the I-quants are slow on Metal and CPU, but I've just done a bit of testing between
mradermacher/Qwen2.5-Coder-7B-Instruct-GGUF/Qwen2.5-Coder-7B-Instruct.Q4_K_S.gguf
andmradermacher/Qwen2.5-Coder-7B-Instruct-i1-GGUF/Qwen2.5-Coder-7B-Instruct.i1-IQ4_XS.gguf
and the speed difference seemed statistically insignificant. I used mmap and full 32K context, but haven't used Flash attention as I wasn't sure if it's properly supported.Does anyone know if the slowness was fixed or if the difference wasn't ever significant?
The wiki doesn't link sources so not really sure what to expect.
Beta Was this translation helpful? Give feedback.
All reactions