XQUANT - cache post-norm X, rematerialize K/V on decode #15400

FlorianZimmer · 2025-08-18T17:47:11Z

FlorianZimmer
Aug 18, 2025

I came across this paper a few days ago: https://arxiv.org/abs/2508.10395. It proposes an alternative to KV caching: quantize + cache the post-norm residual X per layer, then rebuild K and V on the fly using existing GEMMs. Because you store one tensor instead of two, cache memory drops by ~2× for MHA. In practice, decode is often memory-bandwidth bound (especially in the common llama.cpp setups), so this can be a net win; the extra compute tends to stay off the critical path on longer contexts, and it can also enable longer context windows by reducing RAM/VRAM pressure.

MVP proposal (default off):

Add --xcache (flag-gated).
Store 4-bit (G=128) X during prefill.
During decode: dequantize a window of X to fp16 → compute K = X * Wk, V = X * Wv → apply RoPE exactly as today.
No ggml op changes, no calibration, no SVD/CL in v1 (those can follow if there’s interest).

Expectation: similar perplexity to strong KV-quant at equal memory, plus practical speedups on long contexts and PCIe-offload setups.

Would a small, reversible MVP like this be acceptable? Any preferences on the flag name or file placement before I open a draft PR with unit tests and some numbers? (Future extensions—XQUANT-CL and GQA latent SVD—are in the paper and could come as separate follow-ups.)

It would be the first time for me contributing on a project publicly so feel free to give me feedback.

ggerganov · 2025-08-18T18:04:37Z

ggerganov
Aug 18, 2025
Maintainer

Could be implemented as a new memory module. Storing the quantized X would be tricky - how do we handle batches that are not multiple of group size? Doubt this would make it to master, but it could be interesting to play with.

3 replies

FlorianZimmer Aug 18, 2025
Author

Thanks! Implementing this as a new memory module makes sense.

Re: “batches not multiple of group size” - the MVP quantises per-token, and the group size (e.g. 128) is along the hidden dimension of X, not across the batch. That means the micro-batch size doesn’t need to be a multiple of the group size: each token row is quantized independently into groups of up to 128 features. The final (partial) group just stores its own (scale, zero) and ceil(cnt/2) packed bytes (4-bit -> 2 vals/byte), so odd hidden sizes or tiny batches work without padding or cross-token coupling.

I would prototype this as an experimental memory module (flag-gated / default-off). It would:

store quantized post-norm X per layer during prefill (4-bit, G=128),
rematerialize K/V from X during decode via existing GEMMs, and
leave RoPE and attention unchanged.

I’ll keep it small, with unit tests (pack/dequant round-trip) and optional integration numbers. If it’s useful, we can iterate. If not, it can live as “interesting to play with”.

ggerganov Aug 19, 2025
Maintainer

the group size (e.g. 128) is along the hidden dimension of X, not across the batch.

Good point - I missed that.

The final (partial) group just stores its own (scale, zero) and ceil(cnt/2) packed bytes (4-bit -> 2 vals/byte), so odd hidden sizes or tiny batches work without padding or cross-token coupling.

Most quantization types in ggml use group size of 32. Since for all models the hidden size is multiple of 32, there should never be partial groups to handle. This simplifies the implementation significantly.

I’ll keep it small, with unit tests (pack/dequant round-trip) and optional integration numbers.

The quantization/dequantization routines are already implemented and tested. No need to implement new ones. For example use the existing Q4_0 or Q8_0 data types.

FlorianZimmer Aug 20, 2025
Author

Thanks for the feedback and just a quick update. I am working on implementing this on my branch and using the quantization/dequantization routines that are already implemented.

Here is how far I got so far.
It took a while to find and understand the relevant code for this feature.

Feel free to give me feedback if I am doing something wrong or unnecessarily complex.

Integration Overview

Feature gate: off by default; currently via LLAMA_XQUANT=1 (CLI flag like --xcache can be added).
Placement: implemented as a KV memory wrapper; graph remains unchanged.
Prefill (capture): right after attention RMSNorm (post-norm X), defer device→host copy and append quantized rows of X to the XQuant store via a post-run callback.
Decode (read):
- If enabled and layer exposes separate Wk/Wv (not fused QKV), rematerialize past slice [0..T_past) as [T_past, d] using existing GEMMs (X·Wk, X·Wv).
- For K, apply RoPE exactly as baseline and shape to the cache layout, then overlay past rows into the baseline tensor using the same ggml_set_rows pattern.
- Otherwise fallback to baseline get_k/get_v.
Logging: one INFO on enable; one DEBUG once (wk/wv detected vs. not exposed).

Current Status (in branch)

KV wrapper scaffolding
- xquant_enabled() gate.
- Safe accessors for Wk/Wv (skip if fused).
- xq_capture_X_defer(res, X_norm, il) defers staging post-norm X and appends quantized rows.
- get_v_xq(...) rematerializes V and overlays past rows via ggml_set_rows.
- get_k_xq(...) currently falls back to baseline; overlay path pending (will reuse KV’s RoPE/shaping helpers).
Graph & run-loop
- Unified-KV self-attention path prefers get_k_xq/get_v_xq under the gate; ISWA/hybrid, cross-attn, no-cache remain baseline.
- Post-run callback queue added to llm_graph_result; llama-context.cpp calls res->run_post_cbs() after each ubatch compute.

What’s Left

Finish K overlay
- Use unified-KV RoPE + shape helpers on rematerialized [T_past, d] and overlay via the same 2D reshape + ggml_set_rows as baseline writes.
CLI flag (optional)
- Add --xcache/--xquant and plumb to cparams (keep env gate for experiments).
Tests & validation
- Quantized-X round-trip (using existing Q4_0/Q8_0).
- Deterministic prompts with/without XQuant (logits within FP epsilon).
- Benchmarks: no regressions when disabled; memory/throughput at long contexts when enabled.

jubruckne · 2025-08-18T19:00:11Z

jubruckne
Aug 18, 2025

Is this not just a variant of multihead latent attention without latent compression?

1 reply

FlorianZimmer Aug 18, 2025
Author

not really. Both reduce KV memory, but MLA changes the attention operator by doing attention in a learned low-rank latent space (usually needs conversion/fine-tuning), whereas XQuant leaves attention unchanged and instead caches quantized post-norm residual
X and rematerializes K/V on the fly.
Practically, XQuant is drop-in for existing MHA/GQA models (no retraining) at the cost of extra decode compute. MLA can deliver bigger long-context advantages when models are trained/converted for it, but it’s not a drop-in. Given many strong MHA/GQA models are still shipping (e.g., GQA like gpt-oss), implementing XQuant still makes sense while the ecosystem for MLA matures.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

XQUANT - cache post-norm X, rematerialize K/V on decode #15400

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

XQUANT - cache post-norm X, rematerialize K/V on decode #15400

Uh oh!

FlorianZimmer Aug 18, 2025

Replies: 2 comments · 4 replies

Uh oh!

Uh oh!

ggerganov Aug 18, 2025 Maintainer

Uh oh!

FlorianZimmer Aug 18, 2025 Author

Uh oh!

ggerganov Aug 19, 2025 Maintainer

Uh oh!

Uh oh!

FlorianZimmer Aug 20, 2025 Author

Uh oh!

jubruckne Aug 18, 2025

Uh oh!

FlorianZimmer Aug 18, 2025 Author

FlorianZimmer
Aug 18, 2025

Replies: 2 comments 4 replies

ggerganov
Aug 18, 2025
Maintainer

FlorianZimmer Aug 18, 2025
Author

ggerganov Aug 19, 2025
Maintainer

FlorianZimmer Aug 20, 2025
Author

jubruckne
Aug 18, 2025

FlorianZimmer Aug 18, 2025
Author