Skip to content

Fix cold-start latency + ZMQ reliability#1

Open
raoabinav wants to merge 92 commits into
mainfrom
abinav/issue-177-cold-start-LONFb
Open

Fix cold-start latency + ZMQ reliability#1
raoabinav wants to merge 92 commits into
mainfrom
abinav/issue-177-cold-start-LONFb

Conversation

@raoabinav
Copy link
Copy Markdown
Owner

Closes yichuan-w#177, yichuan-w#182

What changed and why:

  • Removed redundant _ensure_server_running call inside compute_query_embedding. Previously, every query embedding triggered a second server health check even though the caller (api.py) already called _ensure_server_running. This was the main source of cold-start latency — a double server startup on first query. Now compute_query_embedding trusts the port it receives from the caller, eliminating the redundant check.
  • Added ZMQ retry with exponential backoff (0.5s, 1s, 2s) in _compute_embedding_via_server. The original code did a single ZMQ send/recv with no retry — any transient connection failure (server still loading, socket not yet bound) was a hard crash. Now retries up to 3 times with proper socket teardown between attempts.
  • Set ZMQ.SNDTIMEO=10s and ZMQ.LINGER=0. Without SNDTIMEO, a dead server caused an indefinite hang on socket.send(). Without LINGER=0, socket.close() would block waiting for unsent messages, stalling the retry loop.
  • enable_warmup is now a kwargs pop, not a forward. It was being passed through to start_server() which didn't understand it. Now it's consumed by _ensure_server_running to fire a dummy embedding request after server start, pre-loading the model into GPU memory before the first real query hits.
  • Replaced print() with logging.getLogger(__name__) in searcher_base. The original used bare print("âš ï¸� ...") for error reporting, which is invisible in production and breaks structured logging pipelines.
  • Added timing instrumentation across the entire server lifecycle — start_server, _start_new_server, _ensure_server_running, compute_query_embedding all log elapsed time so you can actually profile where cold-start time goes.

ASuresh0524 and others added 30 commits November 10, 2025 13:31
- Add ColQwenRAG class with easy-to-use CLI for multimodal PDF retrieval
- Support for both ColQwen2 and ColPali models with automatic device selection
- MPS optimization for Apple Silicon with memory-efficient loading
- Complete pipeline: PDF→images→embeddings→HNSW index→search
- Multi-vector indexing for fine-grained document matching
- Comprehensive user guide and reproduction test script
- Resolves yichuan-w#119: ColQwen Doc and Support Management

Features:
- python -m apps.colqwen_rag build --pdfs ./pdfs/ --index my_index
- python -m apps.colqwen_rag search my_index "query text"
- python -m apps.colqwen_rag ask my_index --interactive
- Automatic CPU fallback for memory constraints
- Robust error handling and progress tracking
- Add noqa comments for E402 errors (imports after sys.path modifications)
- Remove unused variable assignment in colqwen_rag.py
- Use importlib.util.find_spec for dependency checks instead of unused imports
- Fix import ordering in test_colqwen_reproduction.py
- Add apps/image_rag.py for indexing and searching images using CLIP embeddings
- Supports text-based image search queries
- Uses CLIP ViT-L/14 model via sentence-transformers
- Follows the same pattern as other RAG apps in the apps directory
- Addresses feature request for CLIP support in apps (issue yichuan-w#94)
…ichuan-w#179)

Fixes yichuan-w#175

Problem:
When --file-types .pdf is specified, PDFs were being processed twice:
1. Separately with PyMuPDF/pdfplumber extractors
2. Again in the 'other file types' section via SimpleDirectoryReader

This caused duplicate processing and potential conflicts.

Solution:
- Exclude .pdf from other_file_extensions when PDFs are already
  processed separately
- Only load other file types if there are extensions to process
- Prevents duplicate PDF processing

Changes:
- Added logic to filter out .pdf from code_extensions when loading
  other file types if PDFs were processed separately
- Updated SimpleDirectoryReader to use filtered extensions
- Added check to skip loading if no other extensions to process
…r multi-vector… (yichuan-w#161)

* Add timing instrumentation and multi-dataset support for multi-vector retrieval

- Add timing measurements for search operations (load and core time)
- Increase embedding batch size from 1 to 32 for better performance
- Add explicit memory cleanup with del all_embeddings
- Support loading and merging multiple datasets with different splits
- Add CLI arguments for search method selection (ann/exact/exact-all)
- Auto-detect image field names across different dataset structures
- Print candidate doc counts for performance monitoring

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* update vidore

* reproduce docvqa results

* reproduce docvqa results and add debug file

---------

Co-authored-by: Claude <noreply@anthropic.com>
* Add timing instrumentation and multi-dataset support for multi-vector retrieval

- Add timing measurements for search operations (load and core time)
- Increase embedding batch size from 1 to 32 for better performance
- Add explicit memory cleanup with del all_embeddings
- Support loading and merging multiple datasets with different splits
- Add CLI arguments for search method selection (ann/exact/exact-all)
- Auto-detect image field names across different dataset structures
- Print candidate doc counts for performance monitoring

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* update vidore

* reproduce docvqa results

* reproduce docvqa results and add debug file

* fix: format colqwen_forward.py to pass pre-commit checks

---------

Co-authored-by: Claude <noreply@anthropic.com>
Reset faiss submodule to match main branch to avoid unnecessary changes
- Add ColQwen2.5 and ColQwen2_5_Processor imports
- Implement smart model type detection for colqwen2, colqwen2.5, and colpali
- Add task name aliases for easier benchmark invocation
- Add safe model name handling for file paths and index naming
- Support custom model paths including LoRA adapters
- Improve model choice validation and error handling

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude <noreply@anthropic.com>
Add brief introduction and usage guide for ColQwen integration,
similar to other RAG application sections in the README.

- Quick start examples for building, searching, and interactive Q&A
- Setup instructions with prerequisites
- Model options (ColQwen2 vs ColPali)
- Link to detailed ColQwen guide
Add COLQWEN_GUIDE.md to docs/ directory for proper documentation structure.
This file is referenced in the README and needs to be tracked in git.
Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>
* Add Anthropic LLM support

Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>

* Update skypilot link

Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>

* Handle anthropic base_url

Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>

* Address ruff format finding

Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>

---------

Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>
yichuan-w#188)

* Add custom folder support and improve image loading for multi-vector retrieval

- Enhanced _load_images_from_dir with recursive search support and better error handling
- Added support for WebP format and RGB conversion for all image modes
- Added custom folder CLI arguments (--custom-folder, --recursive, --rebuild-index)
- Improved documentation and removed completed TODO comment

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Format code style in leann_multi_vector.py for better readability

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
…ration

add ColQwen multimodal PDF retrieval integration
…w#189)

* Add custom folder support and improve image loading for multi-vector retrieval

- Enhanced _load_images_from_dir with recursive search support and better error handling
- Added support for WebP format and RGB conversion for all image modes
- Added custom folder CLI arguments (--custom-folder, --recursive, --rebuild-index)
- Improved documentation and removed completed TODO comment

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Format code style in leann_multi_vector.py for better readability

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* docs: polish README performance tip section

- Fix typo: 'matrilize' -> 'materialize'
- Improve clarity and formatting of --no-recompute flag explanation
- Add code block for better readability

* format

---------

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
…huan-w#157)

PR yichuan-w#157 changed create_text_chunks() to return list[dict] instead of
list[str] to preserve metadata, but base_rag_example.py was not updated
to handle the new format. This caused all chunks to fail validation
with "All provided chunks are empty or invalid".
* Drop Python 3.9 support, require Python 3.10+

Python 3.9 reached end-of-life and the codebase uses PEP 604 union
type syntax (str | None) which requires Python 3.10+.

Changes:
- Remove Python 3.9 from CI build matrix
- Update requires-python to >=3.10 in all pyproject.toml files
- Update classifiers to reflect supported Python versions (3.10-3.13)

* Remove macos-13 from CI build matrix

GitHub Actions deprecated macos-13 runner (brownout started Sept 2025,
fully retired Dec 2025). See: https://github.blog/changelog/2025-09-19-github-actions-macos-13-runner-image-is-closing-down/

* Add macos-15-large for Intel Mac builds

Replace deprecated macos-13 with macos-15-large (x86_64 Intel)
to continue supporting Intel Mac users.

* Set MACOSX_DEPLOYMENT_TARGET=13.x for Intel builds

Intel Mac wheels (macos-15-large) now target macOS 13.0/13.3 for
backward compatibility, allowing macOS 13/14/15 Intel users to
install pre-built wheels.

* Remove Intel Mac builds (macos-15-large requires paid plan)

Intel Mac users can build from source. This avoids:
- Paid GitHub Actions runners (macos-15-large)
- Complex cross-compilation setup

* Add macos-15-intel for Intel Mac builds (free runner)

Use macos-15-intel (free standard runner) instead of macos-15-large
(paid). This provides Intel Mac wheel support until Aug 2027.

- MACOSX_DEPLOYMENT_TARGET=13.0 for backward compatibility
- Replaces deprecated macos-13 runner

* Add macOS 26 (beta) to build matrix

Add macos-26 (arm64) runner to the build matrix for testing future
macOS compatibility. This is currently a beta runner that helps ensure
wheels work on upcoming macOS versions.

* Fix macos-15-intel deployment target

The macos-15-intel runner runs macOS 15.7, so Homebrew libraries are
built for macOS 14+. Setting MACOSX_DEPLOYMENT_TARGET=13.0 causes
delocate to fail because system libraries require newer macOS.

Fix by setting deployment target to 15.0 for macos-15-intel, matching
the actual OS version. Intel Mac users will need macOS 15+.

* Exclude macos-15-intel + Python 3.13 (no PyTorch wheels available)
…uan-w#157) (yichuan-w#192)

* Add ty type checker to CI and fix type errors

- Add ty (Astral's fast Python type checker) to GitHub CI workflow
- Fix type annotations across all RAG apps:
  - Update load_data return types from list[str] to list[dict[str, Any]]
  - Fix base_rag_example.py to properly handle dict format from create_text_chunks
- Fix type errors in leann-core:
  - chunking_utils.py: Add explicit type annotations
  - cli.py: Fix return type annotations for PDF extraction functions
  - interactive_utils.py: Fix readline import type handling
- Fix type errors in apps:
  - wechat_history.py: Fix return type annotations
  - document_rag.py, code_rag.py: Replace **kwargs with explicit arguments
- Add ty configuration to pyproject.toml

This resolves the bug introduced in PR yichuan-w#157 where create_text_chunks()
changed to return list[dict] but callers were not updated.

* Fix remaining ty type errors

- Fix slack_mcp_reader.py channel parameter can be None
- Fix embedding_compute.py ContextProp type issue
- Fix searcher_base.py method override signatures
- Fix chunking_utils.py chunk_text assignment
- Fix slack_rag.py and twitter_rag.py return types
- Fix email.py and image_rag.py method overrides

* Fix multimodal benchmark scripts type errors

- Fix undefined LeannRetriever -> LeannMultiVector
- Add proper type casts for HuggingFace Dataset iteration
- Cast task config values to correct types
- Add type annotations for dataset row dicts

* Enable ty check for multimodal scripts in CI

All type errors in multimodal scripts have been fixed, so we can now
include them in the CI type checking.

* Fix all test type errors and enable ty check on tests

- Fix test_basic.py: search() takes str not list
- Fix test_cli_prompt_template.py: add type: ignore for Mock assignments
- Fix test_prompt_template_persistence.py: match BaseSearcher.search signature
- Fix test_prompt_template_e2e.py: add type narrowing asserts after skip
- Fix test_readme_examples.py: use explicit kwargs instead of **model_args
- Fix metadata_filter.py: allow Optional[MetadataFilters]
- Update CI to run ty check on tests

* Format code with ruff

* Format searcher_base.py
* Add prompt template feature to README

Highlights performance optimization with task-specific prompt templates. Includes real-world benchmark data showing EmbeddingGemma 300M achieving 4-5x speed improvement over Qwen 600M while maintaining identical search quality.

Per maintainer request to promote this feature in main README for better discoverability.

* Fix typo: --embedding-prompt-template -> --query-prompt-template

---------

Co-authored-by: Andy Lee <andylizf@outlook.com>
…#197)

Thanks for the contribution! This is a nice improvement for better UX. 🎉
- Add Jina AI to the cloud providers table with (Embeddings) label
- Add tip section explaining how to use separate embedding provider
  with --embedding-api-base and --embedding-api-key flags
- Add LEANN_EMBEDDING_DEVICE env var for embedding model GPU selection
- Add LEANN_LLM_DEVICE env var for HFChat LLM GPU selection
- When specific GPU (e.g., cuda:1) is set, use it exclusively
- When set to "cuda" or unset, use device_map="auto" for multi-GPU
- Document env vars in README Common Parameters section
- Add batch_size parameter support in provider_options/embedding_options
- When user specifies batch_size, disable adaptive_optimization
- Keep default Qwen3-Embedding batch_size (32) as fallback
Content already exists in docs/configuration-guide.md. The section was
too prominent for an advanced feature and cluttered the README structure.
tolgakaratas and others added 29 commits January 26, 2026 03:39
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add comprehensive documentation for Claude Code instances working
with this repository, including build commands, architecture overview,
testing instructions, and key design patterns.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Ensure ColQwenRAG always sets the processor when the model
reloads on CPU due to memory constraints, preventing embed-time
attribute errors.

Co-authored-by: Cursor <cursoragent@cursor.com>
Fail fast with a clear error when transformers>=4.46 is
installed, and delay colpali_engine imports until after the
version check to avoid HybridCache import crashes.

Co-authored-by: Cursor <cursoragent@cursor.com>
Clean up unused type suppression comments flagged by ty
across apps and core packages.

Co-authored-by: Cursor <cursoragent@cursor.com>
Normalize torch.compile call formatting after ruff.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Ensure all index-* CLI commands accept embedding model/mode
arguments to match their builder usage.

Co-authored-by: Cursor <cursoragent@cursor.com>
…ive-ingestion-and-formats

Comprehensive Data Ingestion & New Format Support
… suppressed search output (yichuan-w#242)

- Add `leann watch` CLI command that compares current files against the
  last Merkle tree checkpoint and reports added/removed/modified files
  with their associated chunk IDs.
- Integrate FileSynchronizer into `leann build` to create initial
  snapshots and persist sync config (sync_roots.json).
- Prepend line numbers to code file chunks (e.g. `42|def foo():`) so
  search results display exact line locations for code navigation.
- Trim partial first lines in code chunks caused by character-based
  overlap to ensure every chunk starts at a clean line boundary.
- Fix `suppress_cpp_output` swallowing Python print() along with C++
  output by redirecting sys.stdout/sys.stderr to saved fd copies while
  OS-level fds go to /dev/null.
- Update README with watch command documentation in Quick Start,
  Usage Examples, and Complete CLI Reference sections.
…yichuan-w#177)

- Fix enable_warmup: pop kwarg in _ensure_server_running and send dummy
  embedding request after server starts (was previously passed to
  start_server as unused kwarg)
- Remove redundant _ensure_server_running call in compute_query_embedding;
  caller (api.py) already ensures server is running before search
- Add retry with exponential backoff (0.5s, 1s, 2s) to ZMQ client in
  _compute_embedding_via_server, with proper socket cleanup between attempts
- Add SNDTIMEO (10s) and LINGER(0) to ZMQ sockets for clean failure
- Add timing instrumentation throughout searcher_base.py and
  embedding_server_manager.py for diagnosing startup and query latency
- Add tests/test_cold_start.py with 13 unit tests

https://claude.ai/code/session_01M6abMs1YzF6yhh13YerDPT
replace all print("[leann] ...") calls with proper logger.info/warning
so timing output respects LEANN_LOG_LEVEL and doesn't pollute stdout.

https://claude.ai/code/session_01M6abMs1YzF6yhh13YerDPT
…stence check (yichuan-w#245)

- Remove erroneous first-line trimming in create_ast_chunks that stripped
  function signatures (e.g. `def hello():`) by assuming all chunks have
  line-number prefixes starting with digits
- Move line number prepending from before AST chunking to after, so the
  AST parser receives valid source code instead of `1|def hello():` which
  breaks syntax tree parsing and causes fallback to naive text splitting
- Fix index existence check in base_rag_example.py to look for the actual
  .meta.json file instead of just the directory (empty temp dirs always exist)

Co-authored-by: Cursor <cursoragent@cursor.com>
- Add zmq_port is not None guard before server path (ty invalid-argument-type)
- Fix _TestSearcher.search override to match interface signature (ty invalid-method-override)
- Put logger.error/logger.warning on single lines (ruff format)

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@SuperPauly
Copy link
Copy Markdown

This PR is rather large, having changed 78 files had 92 commits and going on for 4 months.

Merging in to Main any time soon?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Search with recompute second level latency for code RAG