Skip to content

feat: Shared embedding cache across workspaces with optional Cloudflare sync #101

Description

@bradleat

Problem

When using ck across git worktrees or team environments, identical code content is re-embedded repeatedly:

  • Worktrees: Creating a worktree from main requires full re-indexing (minutes)
  • Team: Each developer computes the same embeddings locally
  • CI/CD: Pipelines start fresh every run
  • Merges: Merging a feature branch re-computes embeddings that already exist

Proposed Solution

A content-addressed embedding cache that:

  1. Locally: SQLite database in a "base" directory, shared across workspaces
  2. Optionally remote: Cloudflare D1 (metadata) + R2 (blobs) + Durable Objects (coordination)
  3. Safe GC: Reference counting with global awareness before deletion
  4. Sync: Monotonic sequence numbers, per-item acknowledgment, streaming for large payloads

Architecture Overview

LOCAL:
  ~/gitspace/project/base/.ck/
    embeddings.db          # SQLite: metadata + blobs + refs
    
  ~/gitspace/project/workspaces/feature-x/.ck/
    config.toml            # base = "../../base"
    manifest.bin.zst       # File → chunk mappings
    ann_index.bin          # HNSW index

CLOUDFLARE (Optional):
  D1: metadata (hash, seq, r2_key, model_id, refs, audit_log)
  R2: embeddings/{hash}.bin (blobs, $0.015/GB/mo, no size limit)
  DO: EmbeddingCoordinator (seq numbers, compute locks, rate limits, heartbeats)
  Worker: /handshake, /embed, /push, /pull, /ref-count, /heartbeat

Key Design Decisions

Content-Addressed Storage

  • Hash = blake3(model_id || model_version || content)
  • Same content = same hash = reuse embedding
  • Model changes = different hash = no accidental mixing

Reference Counting for Safe GC

-- Only delete if no local AND no remote refs
DELETE FROM embeddings 
WHERE hash NOT IN (SELECT hash FROM refs)
  AND created_at < unixepoch() - 3600;  -- 1hr grace

-- Before delete, check remote:
POST /api/ref-count { hashes: [...] }

Monotonic Sequence Numbers for Sync

  • Durable Object maintains global sequence
  • Sync by sequence, not timestamp (avoids clock skew issues)
  • Per-item acknowledgment on push (handles partial failures)

Compute Locks (Cloudflare)

  • DO lock prevents duplicate computation when two clients need same embedding
  • First acquires lock, computes, stores
  • Second waits, gets from cache

CLI Commands

# Workspace management
ck init --base ../../base         # Link workspace to base
ck workspace list                 # Show registered workspaces

# Cache management  
ck cache stats                    # Size, hit rate, workspaces
ck cache gc                       # Clean unreferenced embeddings
ck cache gc --dry-run             # Preview deletions

# Cloudflare sync (when configured)
ck cache push                     # Push new embeddings
ck cache pull                     # Pull from remote
ck cache sync                     # Bidirectional

# Diagnostics
ck doctor                         # Comprehensive health check

Migration Path

Phase 1: Local Shared Cache

  • SQLite in base directory
  • Reference counting GC
  • Workspace registration
  • Config: base = "../../base"

Phase 2: Cloudflare Sync

  • D1 for metadata, R2 for blobs
  • Durable Object for coordination
  • Push/pull commands
  • Per-user JWT auth

Phase 3: Team Features

  • ck cloudflare join onboarding wizard
  • ck doctor diagnostics
  • Usage analytics

Cost Estimate (Cloudflare, 10 devs, 1M embeddings)

Resource Cost
R2 Storage (~1.5GB) $0.02/mo
D1 Storage (~100MB) $0.08/mo
D1 + Workers ops ~$1/mo
Workers AI (optional) ~$3/mo
Total ~$5-20/mo

Open Questions

  1. Embedding format: Float32 vs quantized Int8 (4x smaller)?
  2. Compression: Zstd compress in R2?
  3. Model migration: Tooling when embedding model updates?

Full RFC

A comprehensive RFC with schemas, Worker code, sync protocol details, and security considerations is available. Happy to share if helpful for discussion.


This would significantly improve the workflow for:

  • Developers using git worktrees
  • Teams sharing codebase understanding
  • CI/CD pipelines with warm caches

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions