XQUANT - cache post-norm X, rematerialize K/V on decode #15400
FlorianZimmer
started this conversation in
Ideas
Replies: 2 comments 4 replies
-
Could be implemented as a new memory module. Storing the quantized X would be tricky - how do we handle batches that are not multiple of group size? Doubt this would make it to |
Beta Was this translation helpful? Give feedback.
3 replies
-
Is this not just a variant of multihead latent attention without latent compression? |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I came across this paper a few days ago: https://arxiv.org/abs/2508.10395. It proposes an alternative to KV caching: quantize + cache the post-norm residual X per layer, then rebuild K and V on the fly using existing GEMMs. Because you store one tensor instead of two, cache memory drops by ~2× for MHA. In practice, decode is often memory-bandwidth bound (especially in the common llama.cpp setups), so this can be a net win; the extra compute tends to stay off the critical path on longer contexts, and it can also enable longer context windows by reducing RAM/VRAM pressure.
MVP proposal (default off):
Expectation: similar perplexity to strong KV-quant at equal memory, plus practical speedups on long contexts and PCIe-offload setups.
Would a small, reversible MVP like this be acceptable? Any preferences on the flag name or file placement before I open a draft PR with unit tests and some numbers? (Future extensions—XQUANT-CL and GQA latent SVD—are in the paper and could come as separate follow-ups.)
It would be the first time for me contributing on a project publicly so feel free to give me feedback.
Beta Was this translation helpful? Give feedback.
All reactions