Skip to content

feat: add persistent embedding daemon to eliminate cold-start latency#6

Open
raoabinav wants to merge 5 commits into
mainfrom
abinav/issue-166-warmup-LONFb
Open

feat: add persistent embedding daemon to eliminate cold-start latency#6
raoabinav wants to merge 5 commits into
mainfrom
abinav/issue-166-warmup-LONFb

Conversation

@raoabinav
Copy link
Copy Markdown
Owner

Closes yichuan-w#166

What changed and why:

  • New leann serve command that starts a long-running background process keeping the embedding model loaded in memory. The first leann search after boot normally takes 10-30s (model load + tokenizer init). With the daemon pre-warmed, it's < 100ms.
  • Daemon lifecycle: --foreground, --stop, --status. Writes a PID file to ~/.leann/daemon.pid and state to ~/.leann/daemon.json. --stop sends SIGTERM and cleans up the PID file. --status reads the state file and checks if the process is still alive.
  • EmbeddingServerManager.connect_to_daemon() added — before starting a new server subprocess, the manager now checks if a daemon is already running on the expected port and reuses it. This is the key integration point: existing leann search commands transparently benefit from the daemon without any code changes in the search path.
  • Chose process-level daemon over thread-level because the embedding model (especially sentence-transformers with PyTorch) holds a GIL-heavy computation path. A separate process avoids GIL contention with the ZMQ server's event loop. The daemon uses the same ZMQ protocol as the subprocess-based server — zero changes to the client side.
  • Relationship with PR 1 (cold-start fix): PR 1 fixes the cold-start for one-off queries (retry + no double-check). This PR eliminates cold-start entirely for repeated usage (daemon stays warm). They're complementary — PR 1 is the safety net, this is the performance optimization.

…yichuan-w#166)

Adds `leann serve` command that starts a background embedding server daemon,
keeping the model warm between searches. Reduces first-search latency from
30-60s to near-zero by avoiding repeated model loads.

- New `embedding_daemon.py` with daemon lifecycle management (start/stop/status)
- Heartbeat-based health monitoring with stale state cleanup
- EmbeddingServerManager auto-detects running daemon before spawning new servers
- CLI: `leann serve`, `leann serve --stop`, `leann serve --status`
- 18 unit tests covering state management, integration, and CLI

https://claude.ai/code/session_01M6abMs1YzF6yhh13YerDPT
…aemon

- Remove sys.exit() from signal handler to prevent SystemExit during
  arbitrary code; use shutdown flag instead for clean exit
- Redirect daemon subprocess stderr to ~/.leann/daemon.log instead of
  DEVNULL so startup failures can be diagnosed
- Include log file path in error messages when daemon fails to start

https://claude.ai/code/session_01M6abMs1YzF6yhh13YerDPT
The log_fh was opened and passed to subprocess.Popen but the parent
process never closed its copy, leaking a file descriptor.

https://claude.ai/code/session_01M6abMs1YzF6yhh13YerDPT
The daemon started the embedding server without --passages-file, which
meant recompute mode (HNSW needs to resolve passage IDs during graph
construction) would silently fail when search went through the daemon.

Thread passages_file through:
- run_daemon() → _start_background() / _run_foreground()
- _run_foreground() → manager.start_server(passages_file=...)
- CLI --passages-file arg for python -m leann.embedding_daemon
- Stored in daemon.json state so clients can verify compatibility

https://claude.ai/code/session_01M6abMs1YzF6yhh13YerDPT
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Better warmup strategy

1 participant