Fix cold-start latency + ZMQ reliability#1
Open
raoabinav wants to merge 92 commits into
Open
Conversation
- Add ColQwenRAG class with easy-to-use CLI for multimodal PDF retrieval - Support for both ColQwen2 and ColPali models with automatic device selection - MPS optimization for Apple Silicon with memory-efficient loading - Complete pipeline: PDF→images→embeddings→HNSW index→search - Multi-vector indexing for fine-grained document matching - Comprehensive user guide and reproduction test script - Resolves yichuan-w#119: ColQwen Doc and Support Management Features: - python -m apps.colqwen_rag build --pdfs ./pdfs/ --index my_index - python -m apps.colqwen_rag search my_index "query text" - python -m apps.colqwen_rag ask my_index --interactive - Automatic CPU fallback for memory constraints - Robust error handling and progress tracking
- Add noqa comments for E402 errors (imports after sys.path modifications) - Remove unused variable assignment in colqwen_rag.py - Use importlib.util.find_spec for dependency checks instead of unused imports - Fix import ordering in test_colqwen_reproduction.py
- Add apps/image_rag.py for indexing and searching images using CLIP embeddings - Supports text-based image search queries - Uses CLIP ViT-L/14 model via sentence-transformers - Follows the same pattern as other RAG apps in the apps directory - Addresses feature request for CLIP support in apps (issue yichuan-w#94)
…ichuan-w#179) Fixes yichuan-w#175 Problem: When --file-types .pdf is specified, PDFs were being processed twice: 1. Separately with PyMuPDF/pdfplumber extractors 2. Again in the 'other file types' section via SimpleDirectoryReader This caused duplicate processing and potential conflicts. Solution: - Exclude .pdf from other_file_extensions when PDFs are already processed separately - Only load other file types if there are extensions to process - Prevents duplicate PDF processing Changes: - Added logic to filter out .pdf from code_extensions when loading other file types if PDFs were processed separately - Updated SimpleDirectoryReader to use filtered extensions - Added check to skip loading if no other extensions to process
…r multi-vector… (yichuan-w#161) * Add timing instrumentation and multi-dataset support for multi-vector retrieval - Add timing measurements for search operations (load and core time) - Increase embedding batch size from 1 to 32 for better performance - Add explicit memory cleanup with del all_embeddings - Support loading and merging multiple datasets with different splits - Add CLI arguments for search method selection (ann/exact/exact-all) - Auto-detect image field names across different dataset structures - Print candidate doc counts for performance monitoring 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * update vidore * reproduce docvqa results * reproduce docvqa results and add debug file --------- Co-authored-by: Claude <noreply@anthropic.com>
…pport fo…" (yichuan-w#180) This reverts commit 00770ae.
* Add timing instrumentation and multi-dataset support for multi-vector retrieval - Add timing measurements for search operations (load and core time) - Increase embedding batch size from 1 to 32 for better performance - Add explicit memory cleanup with del all_embeddings - Support loading and merging multiple datasets with different splits - Add CLI arguments for search method selection (ann/exact/exact-all) - Auto-detect image field names across different dataset structures - Print candidate doc counts for performance monitoring 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * update vidore * reproduce docvqa results * reproduce docvqa results and add debug file * fix: format colqwen_forward.py to pass pre-commit checks --------- Co-authored-by: Claude <noreply@anthropic.com>
Reset faiss submodule to match main branch to avoid unnecessary changes
- Add ColQwen2.5 and ColQwen2_5_Processor imports - Implement smart model type detection for colqwen2, colqwen2.5, and colpali - Add task name aliases for easier benchmark invocation - Add safe model name handling for file paths and index naming - Support custom model paths including LoRA adapters - Improve model choice validation and error handling 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude <noreply@anthropic.com>
Add brief introduction and usage guide for ColQwen integration, similar to other RAG application sections in the README. - Quick start examples for building, searching, and interactive Q&A - Setup instructions with prerequisites - Model options (ColQwen2 vs ColPali) - Link to detailed ColQwen guide
Add COLQWEN_GUIDE.md to docs/ directory for proper documentation structure. This file is referenced in the README and needs to be tracked in git.
Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>
* Add Anthropic LLM support Signed-off-by: droctothorpe <mythicalsunlight@gmail.com> * Update skypilot link Signed-off-by: droctothorpe <mythicalsunlight@gmail.com> * Handle anthropic base_url Signed-off-by: droctothorpe <mythicalsunlight@gmail.com> * Address ruff format finding Signed-off-by: droctothorpe <mythicalsunlight@gmail.com> --------- Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>
yichuan-w#188) * Add custom folder support and improve image loading for multi-vector retrieval - Enhanced _load_images_from_dir with recursive search support and better error handling - Added support for WebP format and RGB conversion for all image modes - Added custom folder CLI arguments (--custom-folder, --recursive, --rebuild-index) - Improved documentation and removed completed TODO comment 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Format code style in leann_multi_vector.py for better readability 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
…ration add ColQwen multimodal PDF retrieval integration
…w#189) * Add custom folder support and improve image loading for multi-vector retrieval - Enhanced _load_images_from_dir with recursive search support and better error handling - Added support for WebP format and RGB conversion for all image modes - Added custom folder CLI arguments (--custom-folder, --recursive, --rebuild-index) - Improved documentation and removed completed TODO comment 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Format code style in leann_multi_vector.py for better readability 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * docs: polish README performance tip section - Fix typo: 'matrilize' -> 'materialize' - Improve clarity and formatting of --no-recompute flag explanation - Add code block for better readability * format --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
…huan-w#157) PR yichuan-w#157 changed create_text_chunks() to return list[dict] instead of list[str] to preserve metadata, but base_rag_example.py was not updated to handle the new format. This caused all chunks to fail validation with "All provided chunks are empty or invalid".
* Drop Python 3.9 support, require Python 3.10+ Python 3.9 reached end-of-life and the codebase uses PEP 604 union type syntax (str | None) which requires Python 3.10+. Changes: - Remove Python 3.9 from CI build matrix - Update requires-python to >=3.10 in all pyproject.toml files - Update classifiers to reflect supported Python versions (3.10-3.13) * Remove macos-13 from CI build matrix GitHub Actions deprecated macos-13 runner (brownout started Sept 2025, fully retired Dec 2025). See: https://github.blog/changelog/2025-09-19-github-actions-macos-13-runner-image-is-closing-down/ * Add macos-15-large for Intel Mac builds Replace deprecated macos-13 with macos-15-large (x86_64 Intel) to continue supporting Intel Mac users. * Set MACOSX_DEPLOYMENT_TARGET=13.x for Intel builds Intel Mac wheels (macos-15-large) now target macOS 13.0/13.3 for backward compatibility, allowing macOS 13/14/15 Intel users to install pre-built wheels. * Remove Intel Mac builds (macos-15-large requires paid plan) Intel Mac users can build from source. This avoids: - Paid GitHub Actions runners (macos-15-large) - Complex cross-compilation setup * Add macos-15-intel for Intel Mac builds (free runner) Use macos-15-intel (free standard runner) instead of macos-15-large (paid). This provides Intel Mac wheel support until Aug 2027. - MACOSX_DEPLOYMENT_TARGET=13.0 for backward compatibility - Replaces deprecated macos-13 runner * Add macOS 26 (beta) to build matrix Add macos-26 (arm64) runner to the build matrix for testing future macOS compatibility. This is currently a beta runner that helps ensure wheels work on upcoming macOS versions. * Fix macos-15-intel deployment target The macos-15-intel runner runs macOS 15.7, so Homebrew libraries are built for macOS 14+. Setting MACOSX_DEPLOYMENT_TARGET=13.0 causes delocate to fail because system libraries require newer macOS. Fix by setting deployment target to 15.0 for macos-15-intel, matching the actual OS version. Intel Mac users will need macOS 15+. * Exclude macos-15-intel + Python 3.13 (no PyTorch wheels available)
…uan-w#157) (yichuan-w#192) * Add ty type checker to CI and fix type errors - Add ty (Astral's fast Python type checker) to GitHub CI workflow - Fix type annotations across all RAG apps: - Update load_data return types from list[str] to list[dict[str, Any]] - Fix base_rag_example.py to properly handle dict format from create_text_chunks - Fix type errors in leann-core: - chunking_utils.py: Add explicit type annotations - cli.py: Fix return type annotations for PDF extraction functions - interactive_utils.py: Fix readline import type handling - Fix type errors in apps: - wechat_history.py: Fix return type annotations - document_rag.py, code_rag.py: Replace **kwargs with explicit arguments - Add ty configuration to pyproject.toml This resolves the bug introduced in PR yichuan-w#157 where create_text_chunks() changed to return list[dict] but callers were not updated. * Fix remaining ty type errors - Fix slack_mcp_reader.py channel parameter can be None - Fix embedding_compute.py ContextProp type issue - Fix searcher_base.py method override signatures - Fix chunking_utils.py chunk_text assignment - Fix slack_rag.py and twitter_rag.py return types - Fix email.py and image_rag.py method overrides * Fix multimodal benchmark scripts type errors - Fix undefined LeannRetriever -> LeannMultiVector - Add proper type casts for HuggingFace Dataset iteration - Cast task config values to correct types - Add type annotations for dataset row dicts * Enable ty check for multimodal scripts in CI All type errors in multimodal scripts have been fixed, so we can now include them in the CI type checking. * Fix all test type errors and enable ty check on tests - Fix test_basic.py: search() takes str not list - Fix test_cli_prompt_template.py: add type: ignore for Mock assignments - Fix test_prompt_template_persistence.py: match BaseSearcher.search signature - Fix test_prompt_template_e2e.py: add type narrowing asserts after skip - Fix test_readme_examples.py: use explicit kwargs instead of **model_args - Fix metadata_filter.py: allow Optional[MetadataFilters] - Update CI to run ty check on tests * Format code with ruff * Format searcher_base.py
Thanks for the contribution! 🎉
* Add prompt template feature to README Highlights performance optimization with task-specific prompt templates. Includes real-world benchmark data showing EmbeddingGemma 300M achieving 4-5x speed improvement over Qwen 600M while maintaining identical search quality. Per maintainer request to promote this feature in main README for better discoverability. * Fix typo: --embedding-prompt-template -> --query-prompt-template --------- Co-authored-by: Andy Lee <andylizf@outlook.com>
…#197) Thanks for the contribution! This is a nice improvement for better UX. 🎉
- Add Jina AI to the cloud providers table with (Embeddings) label - Add tip section explaining how to use separate embedding provider with --embedding-api-base and --embedding-api-key flags
- Add LEANN_EMBEDDING_DEVICE env var for embedding model GPU selection - Add LEANN_LLM_DEVICE env var for HFChat LLM GPU selection - When specific GPU (e.g., cuda:1) is set, use it exclusively - When set to "cuda" or unset, use device_map="auto" for multi-GPU - Document env vars in README Common Parameters section
- Add batch_size parameter support in provider_options/embedding_options - When user specifies batch_size, disable adaptive_optimization - Keep default Qwen3-Embedding batch_size (32) as fallback
Content already exists in docs/configuration-guide.md. The section was too prominent for an advanced feature and cluttered the README structure.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add comprehensive documentation for Claude Code instances working with this repository, including build commands, architecture overview, testing instructions, and key design patterns. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Ensure ColQwenRAG always sets the processor when the model reloads on CPU due to memory constraints, preventing embed-time attribute errors. Co-authored-by: Cursor <cursoragent@cursor.com>
Fail fast with a clear error when transformers>=4.46 is installed, and delay colpali_engine imports until after the version check to avoid HybridCache import crashes. Co-authored-by: Cursor <cursoragent@cursor.com>
Clean up unused type suppression comments flagged by ty across apps and core packages. Co-authored-by: Cursor <cursoragent@cursor.com>
Normalize torch.compile call formatting after ruff. Co-authored-by: Cursor <cursoragent@cursor.com>
…eddings Fix/colqwen empty embeddings
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Ensure all index-* CLI commands accept embedding model/mode arguments to match their builder usage. Co-authored-by: Cursor <cursoragent@cursor.com>
…ive-ingestion-and-formats Comprehensive Data Ingestion & New Format Support
… suppressed search output (yichuan-w#242) - Add `leann watch` CLI command that compares current files against the last Merkle tree checkpoint and reports added/removed/modified files with their associated chunk IDs. - Integrate FileSynchronizer into `leann build` to create initial snapshots and persist sync config (sync_roots.json). - Prepend line numbers to code file chunks (e.g. `42|def foo():`) so search results display exact line locations for code navigation. - Trim partial first lines in code chunks caused by character-based overlap to ensure every chunk starts at a clean line boundary. - Fix `suppress_cpp_output` swallowing Python print() along with C++ output by redirecting sys.stdout/sys.stderr to saved fd copies while OS-level fds go to /dev/null. - Update README with watch command documentation in Quick Start, Usage Examples, and Complete CLI Reference sections.
…yichuan-w#177) - Fix enable_warmup: pop kwarg in _ensure_server_running and send dummy embedding request after server starts (was previously passed to start_server as unused kwarg) - Remove redundant _ensure_server_running call in compute_query_embedding; caller (api.py) already ensures server is running before search - Add retry with exponential backoff (0.5s, 1s, 2s) to ZMQ client in _compute_embedding_via_server, with proper socket cleanup between attempts - Add SNDTIMEO (10s) and LINGER(0) to ZMQ sockets for clean failure - Add timing instrumentation throughout searcher_base.py and embedding_server_manager.py for diagnosing startup and query latency - Add tests/test_cold_start.py with 13 unit tests https://claude.ai/code/session_01M6abMs1YzF6yhh13YerDPT
replace all print("[leann] ...") calls with proper logger.info/warning
so timing output respects LEANN_LOG_LEVEL and doesn't pollute stdout.
https://claude.ai/code/session_01M6abMs1YzF6yhh13YerDPT
…stence check (yichuan-w#245) - Remove erroneous first-line trimming in create_ast_chunks that stripped function signatures (e.g. `def hello():`) by assuming all chunks have line-number prefixes starting with digits - Move line number prepending from before AST chunking to after, so the AST parser receives valid source code instead of `1|def hello():` which breaks syntax tree parsing and causes fallback to naive text splitting - Fix index existence check in base_rag_example.py to look for the actual .meta.json file instead of just the directory (empty temp dirs always exist) Co-authored-by: Cursor <cursoragent@cursor.com>
- Add zmq_port is not None guard before server path (ty invalid-argument-type) - Fix _TestSearcher.search override to match interface signature (ty invalid-method-override) - Put logger.error/logger.warning on single lines (ruff format) Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
|
This PR is rather large, having changed 78 files had 92 commits and going on for 4 months. Merging in to Main any time soon? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes yichuan-w#177, yichuan-w#182
What changed and why:
_ensure_server_runningcall insidecompute_query_embedding. Previously, every query embedding triggered a second server health check even though the caller (api.py) already called_ensure_server_running. This was the main source of cold-start latency — a double server startup on first query. Nowcompute_query_embeddingtrusts the port it receives from the caller, eliminating the redundant check._compute_embedding_via_server. The original code did a single ZMQ send/recv with no retry — any transient connection failure (server still loading, socket not yet bound) was a hard crash. Now retries up to 3 times with proper socket teardown between attempts.ZMQ.SNDTIMEO=10sandZMQ.LINGER=0. Without SNDTIMEO, a dead server caused an indefinite hang onsocket.send(). Without LINGER=0,socket.close()would block waiting for unsent messages, stalling the retry loop.enable_warmupis now a kwargs pop, not a forward. It was being passed through tostart_server()which didn't understand it. Now it's consumed by_ensure_server_runningto fire a dummy embedding request after server start, pre-loading the model into GPU memory before the first real query hits.print()withlogging.getLogger(__name__)in searcher_base. The original used bareprint("⚠� ...")for error reporting, which is invisible in production and breaks structured logging pipelines.start_server,_start_new_server,_ensure_server_running,compute_query_embeddingall log elapsed time so you can actually profile where cold-start time goes.