fix(cache): compute codebooks on CPU at fp64 for MPS compatibility by synapticode-ai · Pull Request #5 · OnlyTerp/turboquant

synapticode-ai · 2026-05-17T05:41:34Z

Summary

compute_lloyd_max_codebook and compute_online_codebook both hardcode dtype=torch.float64 for their optimization grids. MPS framework doesn't support float64, so both functions fail with TypeError: Cannot convert a MPS Tensor to float64 dtype when callers pass device=torch.device('mps').

This PR fixes both functions by forcing the internal computation onto CPU at fp64 (preserving the algorithms' literature-standard precision for codebook centroid optimization), then moving the final centroids/boundaries to the caller's target device when constructing the returned Codebook dataclass.

Repro

import torch
from turboquant.cache import compute_lloyd_max_codebook  # via the working `from src.cache import ...`
compute_lloyd_max_codebook(d=64, b=4, device=torch.device('mps'))
# TypeError: Cannot convert a MPS Tensor to float64 dtype
#   at cache.py:284 in torch.linspace(..., device=device, dtype=torch.float64)

Fix architecture

The Codebook dataclass is already designed to be device-portable: its quantize/dequantize methods (lines 214–223) call .to(device=x.device, dtype=x.dtype) on the stored tensors at usage time. The fix sits at the natural device-firewall:

Build on CPU at fp64 (where fp64 is supported and Lloyd-Max convergence is numerically stable)
Store on caller's device (via .to(device) at Codebook construction)
Use on operand's device (via the existing .to(...) calls in quantize/dequantize)

This means _beta_pdf and _solve_lloyd_max need no edits — they inherit CPU automatically through tensor argument propagation once the entry-point functions force CPU on their grid construction.

Verification

Verified end-to-end with a downstream consumer's bench harness:

TinyLlama-1.1B FP16 on MPS, β1a KV-cache compression hook (uses compute_lloyd_max_codebook(d=64, b=4, device='mps') via TurboQuantConfig)
Factory build succeeds; hook applied to MPS-resident KV cache; full forward-pass loop completes
Output perplexity is finite (PPL=557.28 for N=1, L=64 at b_mse=4; sanity result, not a quality measurement)
34/34 downstream test suite passes (31 fast unit tests + 3 slow real-model integration tests), including the previously-skipped MPS hook path
CPU path unchanged: compute_lloyd_max_codebook(d=64, b=4, device='cpu') continues to return CPU tensors at fp32

Discovery context

Discovered during gamma-seeds tern-core R-track MPS validation (2026-05-16): both make_b_mse_hook and make_b_mse_hook_uniform factories — KV-cache compression hooks used to evaluate TurboQuant under the R12 KV-cache-compression PPL diagnostic — invoke TurboQuantConfig.__init__ with device='mps'. Both factories failed at construction before any hook could be applied. This fix unblocks downstream MPS-resident KV-cache compression benchmarking.

Diff stats

src/cache.py: +31 / −8 (net +23 lines, mostly added comments + minimal .to(device) plumbing)

🤖 Generated with Claude Code

MPS framework doesn't support float64 dtype. compute_lloyd_max_codebook (line 284) and compute_online_codebook (line 326) both hardcode dtype=torch.float64 for their optimization grids, failing with TypeError when callers pass device=torch.device('mps'). Fix: force internal computation onto CPU at fp64 in both functions (preserving the algorithms' literature-standard precision for codebook centroid optimization), then move the final centroids/boundaries to the caller's target device when constructing the returned Codebook dataclass. This fits the existing Codebook architecture: the dataclass's quantize/ dequantize methods (lines 214-223) already handle device migration at usage time via .to(device=x.device, dtype=x.dtype). The fix sits at the natural device-firewall: build on CPU, store on caller's device, use on operand's device. _beta_pdf and _solve_lloyd_max inherit CPU automatically through tensor argument propagation; no edits needed there. Discovered during gamma-seeds tern-core R-track MPS validation (2026-05-16). Both make_b_mse_hook and make_b_mse_hook_uniform factories invoke TurboQuantConfig with device='mps' for KV-cache compression hooks; both fail at TurboQuantConfig.__init__ before any hook is applied. Verified end-to-end with tern-core's R7-B v1.2 harness on TinyLlama-1.1B FP16 MPS: β1a hook now produces finite PPL output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

synapticode-ai mentioned this pull request May 17, 2026

feat(r12): sweep wrapper for KV-cache compression PPL headroom diagnostic (A'') gamma-seeds/tern-core#34

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cache): compute codebooks on CPU at fp64 for MPS compatibility#5

fix(cache): compute codebooks on CPU at fp64 for MPS compatibility#5
synapticode-ai wants to merge 1 commit into
OnlyTerp:masterfrom
gamma-seeds:fix/lloyd-max-codebook-mps-fp64

synapticode-ai commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

synapticode-ai commented May 17, 2026

Summary

Repro

Fix architecture

Verification

Discovery context

Diff stats

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant