Skip to content

fix(cache): compute codebooks on CPU at fp64 for MPS compatibility#5

Open
synapticode-ai wants to merge 1 commit into
OnlyTerp:masterfrom
gamma-seeds:fix/lloyd-max-codebook-mps-fp64
Open

fix(cache): compute codebooks on CPU at fp64 for MPS compatibility#5
synapticode-ai wants to merge 1 commit into
OnlyTerp:masterfrom
gamma-seeds:fix/lloyd-max-codebook-mps-fp64

Conversation

@synapticode-ai
Copy link
Copy Markdown

Summary

compute_lloyd_max_codebook and compute_online_codebook both hardcode dtype=torch.float64 for their optimization grids. MPS framework doesn't support float64, so both functions fail with TypeError: Cannot convert a MPS Tensor to float64 dtype when callers pass device=torch.device('mps').

This PR fixes both functions by forcing the internal computation onto CPU at fp64 (preserving the algorithms' literature-standard precision for codebook centroid optimization), then moving the final centroids/boundaries to the caller's target device when constructing the returned Codebook dataclass.

Repro

import torch
from turboquant.cache import compute_lloyd_max_codebook  # via the working `from src.cache import ...`
compute_lloyd_max_codebook(d=64, b=4, device=torch.device('mps'))
# TypeError: Cannot convert a MPS Tensor to float64 dtype
#   at cache.py:284 in torch.linspace(..., device=device, dtype=torch.float64)

Fix architecture

The Codebook dataclass is already designed to be device-portable: its quantize/dequantize methods (lines 214–223) call .to(device=x.device, dtype=x.dtype) on the stored tensors at usage time. The fix sits at the natural device-firewall:

  • Build on CPU at fp64 (where fp64 is supported and Lloyd-Max convergence is numerically stable)
  • Store on caller's device (via .to(device) at Codebook construction)
  • Use on operand's device (via the existing .to(...) calls in quantize/dequantize)

This means _beta_pdf and _solve_lloyd_max need no edits — they inherit CPU automatically through tensor argument propagation once the entry-point functions force CPU on their grid construction.

Verification

Verified end-to-end with a downstream consumer's bench harness:

  • TinyLlama-1.1B FP16 on MPS, β1a KV-cache compression hook (uses compute_lloyd_max_codebook(d=64, b=4, device='mps') via TurboQuantConfig)
  • Factory build succeeds; hook applied to MPS-resident KV cache; full forward-pass loop completes
  • Output perplexity is finite (PPL=557.28 for N=1, L=64 at b_mse=4; sanity result, not a quality measurement)
  • 34/34 downstream test suite passes (31 fast unit tests + 3 slow real-model integration tests), including the previously-skipped MPS hook path
  • CPU path unchanged: compute_lloyd_max_codebook(d=64, b=4, device='cpu') continues to return CPU tensors at fp32

Discovery context

Discovered during gamma-seeds tern-core R-track MPS validation (2026-05-16): both make_b_mse_hook and make_b_mse_hook_uniform factories — KV-cache compression hooks used to evaluate TurboQuant under the R12 KV-cache-compression PPL diagnostic — invoke TurboQuantConfig.__init__ with device='mps'. Both factories failed at construction before any hook could be applied. This fix unblocks downstream MPS-resident KV-cache compression benchmarking.

Diff stats

src/cache.py: +31 / −8 (net +23 lines, mostly added comments + minimal .to(device) plumbing)

🤖 Generated with Claude Code

MPS framework doesn't support float64 dtype. compute_lloyd_max_codebook
(line 284) and compute_online_codebook (line 326) both hardcode
dtype=torch.float64 for their optimization grids, failing with TypeError
when callers pass device=torch.device('mps').

Fix: force internal computation onto CPU at fp64 in both functions
(preserving the algorithms' literature-standard precision for codebook
centroid optimization), then move the final centroids/boundaries to the
caller's target device when constructing the returned Codebook dataclass.

This fits the existing Codebook architecture: the dataclass's quantize/
dequantize methods (lines 214-223) already handle device migration at
usage time via .to(device=x.device, dtype=x.dtype). The fix sits at the
natural device-firewall: build on CPU, store on caller's device, use on
operand's device. _beta_pdf and _solve_lloyd_max inherit CPU automatically
through tensor argument propagation; no edits needed there.

Discovered during gamma-seeds tern-core R-track MPS validation
(2026-05-16). Both make_b_mse_hook and make_b_mse_hook_uniform factories
invoke TurboQuantConfig with device='mps' for KV-cache compression hooks;
both fail at TurboQuantConfig.__init__ before any hook is applied.

Verified end-to-end with tern-core's R7-B v1.2 harness on TinyLlama-1.1B
FP16 MPS: β1a hook now produces finite PPL output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant