feat: add LlamaIndex integration with hybrid retriever support#5
Open
raoabinav wants to merge 91 commits into
Open
feat: add LlamaIndex integration with hybrid retriever support#5raoabinav wants to merge 91 commits into
raoabinav wants to merge 91 commits into
Conversation
- Add ColQwenRAG class with easy-to-use CLI for multimodal PDF retrieval - Support for both ColQwen2 and ColPali models with automatic device selection - MPS optimization for Apple Silicon with memory-efficient loading - Complete pipeline: PDF→images→embeddings→HNSW index→search - Multi-vector indexing for fine-grained document matching - Comprehensive user guide and reproduction test script - Resolves yichuan-w#119: ColQwen Doc and Support Management Features: - python -m apps.colqwen_rag build --pdfs ./pdfs/ --index my_index - python -m apps.colqwen_rag search my_index "query text" - python -m apps.colqwen_rag ask my_index --interactive - Automatic CPU fallback for memory constraints - Robust error handling and progress tracking
- Add noqa comments for E402 errors (imports after sys.path modifications) - Remove unused variable assignment in colqwen_rag.py - Use importlib.util.find_spec for dependency checks instead of unused imports - Fix import ordering in test_colqwen_reproduction.py
- Add apps/image_rag.py for indexing and searching images using CLIP embeddings - Supports text-based image search queries - Uses CLIP ViT-L/14 model via sentence-transformers - Follows the same pattern as other RAG apps in the apps directory - Addresses feature request for CLIP support in apps (issue yichuan-w#94)
…ichuan-w#179) Fixes yichuan-w#175 Problem: When --file-types .pdf is specified, PDFs were being processed twice: 1. Separately with PyMuPDF/pdfplumber extractors 2. Again in the 'other file types' section via SimpleDirectoryReader This caused duplicate processing and potential conflicts. Solution: - Exclude .pdf from other_file_extensions when PDFs are already processed separately - Only load other file types if there are extensions to process - Prevents duplicate PDF processing Changes: - Added logic to filter out .pdf from code_extensions when loading other file types if PDFs were processed separately - Updated SimpleDirectoryReader to use filtered extensions - Added check to skip loading if no other extensions to process
…r multi-vector… (yichuan-w#161) * Add timing instrumentation and multi-dataset support for multi-vector retrieval - Add timing measurements for search operations (load and core time) - Increase embedding batch size from 1 to 32 for better performance - Add explicit memory cleanup with del all_embeddings - Support loading and merging multiple datasets with different splits - Add CLI arguments for search method selection (ann/exact/exact-all) - Auto-detect image field names across different dataset structures - Print candidate doc counts for performance monitoring 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * update vidore * reproduce docvqa results * reproduce docvqa results and add debug file --------- Co-authored-by: Claude <noreply@anthropic.com>
…pport fo…" (yichuan-w#180) This reverts commit 00770ae.
* Add timing instrumentation and multi-dataset support for multi-vector retrieval - Add timing measurements for search operations (load and core time) - Increase embedding batch size from 1 to 32 for better performance - Add explicit memory cleanup with del all_embeddings - Support loading and merging multiple datasets with different splits - Add CLI arguments for search method selection (ann/exact/exact-all) - Auto-detect image field names across different dataset structures - Print candidate doc counts for performance monitoring 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * update vidore * reproduce docvqa results * reproduce docvqa results and add debug file * fix: format colqwen_forward.py to pass pre-commit checks --------- Co-authored-by: Claude <noreply@anthropic.com>
Reset faiss submodule to match main branch to avoid unnecessary changes
- Add ColQwen2.5 and ColQwen2_5_Processor imports - Implement smart model type detection for colqwen2, colqwen2.5, and colpali - Add task name aliases for easier benchmark invocation - Add safe model name handling for file paths and index naming - Support custom model paths including LoRA adapters - Improve model choice validation and error handling 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude <noreply@anthropic.com>
Add brief introduction and usage guide for ColQwen integration, similar to other RAG application sections in the README. - Quick start examples for building, searching, and interactive Q&A - Setup instructions with prerequisites - Model options (ColQwen2 vs ColPali) - Link to detailed ColQwen guide
Add COLQWEN_GUIDE.md to docs/ directory for proper documentation structure. This file is referenced in the README and needs to be tracked in git.
Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>
* Add Anthropic LLM support Signed-off-by: droctothorpe <mythicalsunlight@gmail.com> * Update skypilot link Signed-off-by: droctothorpe <mythicalsunlight@gmail.com> * Handle anthropic base_url Signed-off-by: droctothorpe <mythicalsunlight@gmail.com> * Address ruff format finding Signed-off-by: droctothorpe <mythicalsunlight@gmail.com> --------- Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>
yichuan-w#188) * Add custom folder support and improve image loading for multi-vector retrieval - Enhanced _load_images_from_dir with recursive search support and better error handling - Added support for WebP format and RGB conversion for all image modes - Added custom folder CLI arguments (--custom-folder, --recursive, --rebuild-index) - Improved documentation and removed completed TODO comment 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Format code style in leann_multi_vector.py for better readability 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
…ration add ColQwen multimodal PDF retrieval integration
…w#189) * Add custom folder support and improve image loading for multi-vector retrieval - Enhanced _load_images_from_dir with recursive search support and better error handling - Added support for WebP format and RGB conversion for all image modes - Added custom folder CLI arguments (--custom-folder, --recursive, --rebuild-index) - Improved documentation and removed completed TODO comment 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Format code style in leann_multi_vector.py for better readability 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * docs: polish README performance tip section - Fix typo: 'matrilize' -> 'materialize' - Improve clarity and formatting of --no-recompute flag explanation - Add code block for better readability * format --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
…huan-w#157) PR yichuan-w#157 changed create_text_chunks() to return list[dict] instead of list[str] to preserve metadata, but base_rag_example.py was not updated to handle the new format. This caused all chunks to fail validation with "All provided chunks are empty or invalid".
* Drop Python 3.9 support, require Python 3.10+ Python 3.9 reached end-of-life and the codebase uses PEP 604 union type syntax (str | None) which requires Python 3.10+. Changes: - Remove Python 3.9 from CI build matrix - Update requires-python to >=3.10 in all pyproject.toml files - Update classifiers to reflect supported Python versions (3.10-3.13) * Remove macos-13 from CI build matrix GitHub Actions deprecated macos-13 runner (brownout started Sept 2025, fully retired Dec 2025). See: https://github.blog/changelog/2025-09-19-github-actions-macos-13-runner-image-is-closing-down/ * Add macos-15-large for Intel Mac builds Replace deprecated macos-13 with macos-15-large (x86_64 Intel) to continue supporting Intel Mac users. * Set MACOSX_DEPLOYMENT_TARGET=13.x for Intel builds Intel Mac wheels (macos-15-large) now target macOS 13.0/13.3 for backward compatibility, allowing macOS 13/14/15 Intel users to install pre-built wheels. * Remove Intel Mac builds (macos-15-large requires paid plan) Intel Mac users can build from source. This avoids: - Paid GitHub Actions runners (macos-15-large) - Complex cross-compilation setup * Add macos-15-intel for Intel Mac builds (free runner) Use macos-15-intel (free standard runner) instead of macos-15-large (paid). This provides Intel Mac wheel support until Aug 2027. - MACOSX_DEPLOYMENT_TARGET=13.0 for backward compatibility - Replaces deprecated macos-13 runner * Add macOS 26 (beta) to build matrix Add macos-26 (arm64) runner to the build matrix for testing future macOS compatibility. This is currently a beta runner that helps ensure wheels work on upcoming macOS versions. * Fix macos-15-intel deployment target The macos-15-intel runner runs macOS 15.7, so Homebrew libraries are built for macOS 14+. Setting MACOSX_DEPLOYMENT_TARGET=13.0 causes delocate to fail because system libraries require newer macOS. Fix by setting deployment target to 15.0 for macos-15-intel, matching the actual OS version. Intel Mac users will need macOS 15+. * Exclude macos-15-intel + Python 3.13 (no PyTorch wheels available)
…uan-w#157) (yichuan-w#192) * Add ty type checker to CI and fix type errors - Add ty (Astral's fast Python type checker) to GitHub CI workflow - Fix type annotations across all RAG apps: - Update load_data return types from list[str] to list[dict[str, Any]] - Fix base_rag_example.py to properly handle dict format from create_text_chunks - Fix type errors in leann-core: - chunking_utils.py: Add explicit type annotations - cli.py: Fix return type annotations for PDF extraction functions - interactive_utils.py: Fix readline import type handling - Fix type errors in apps: - wechat_history.py: Fix return type annotations - document_rag.py, code_rag.py: Replace **kwargs with explicit arguments - Add ty configuration to pyproject.toml This resolves the bug introduced in PR yichuan-w#157 where create_text_chunks() changed to return list[dict] but callers were not updated. * Fix remaining ty type errors - Fix slack_mcp_reader.py channel parameter can be None - Fix embedding_compute.py ContextProp type issue - Fix searcher_base.py method override signatures - Fix chunking_utils.py chunk_text assignment - Fix slack_rag.py and twitter_rag.py return types - Fix email.py and image_rag.py method overrides * Fix multimodal benchmark scripts type errors - Fix undefined LeannRetriever -> LeannMultiVector - Add proper type casts for HuggingFace Dataset iteration - Cast task config values to correct types - Add type annotations for dataset row dicts * Enable ty check for multimodal scripts in CI All type errors in multimodal scripts have been fixed, so we can now include them in the CI type checking. * Fix all test type errors and enable ty check on tests - Fix test_basic.py: search() takes str not list - Fix test_cli_prompt_template.py: add type: ignore for Mock assignments - Fix test_prompt_template_persistence.py: match BaseSearcher.search signature - Fix test_prompt_template_e2e.py: add type narrowing asserts after skip - Fix test_readme_examples.py: use explicit kwargs instead of **model_args - Fix metadata_filter.py: allow Optional[MetadataFilters] - Update CI to run ty check on tests * Format code with ruff * Format searcher_base.py
Thanks for the contribution! 🎉
* Add prompt template feature to README Highlights performance optimization with task-specific prompt templates. Includes real-world benchmark data showing EmbeddingGemma 300M achieving 4-5x speed improvement over Qwen 600M while maintaining identical search quality. Per maintainer request to promote this feature in main README for better discoverability. * Fix typo: --embedding-prompt-template -> --query-prompt-template --------- Co-authored-by: Andy Lee <andylizf@outlook.com>
…#197) Thanks for the contribution! This is a nice improvement for better UX. 🎉
- Add Jina AI to the cloud providers table with (Embeddings) label - Add tip section explaining how to use separate embedding provider with --embedding-api-base and --embedding-api-key flags
- Add LEANN_EMBEDDING_DEVICE env var for embedding model GPU selection - Add LEANN_LLM_DEVICE env var for HFChat LLM GPU selection - When specific GPU (e.g., cuda:1) is set, use it exclusively - When set to "cuda" or unset, use device_map="auto" for multi-GPU - Document env vars in README Common Parameters section
- Add batch_size parameter support in provider_options/embedding_options - When user specifies batch_size, disable adaptive_optimization - Keep default Qwen3-Embedding batch_size (32) as fallback
Content already exists in docs/configuration-guide.md. The section was too prominent for an advanced feature and cluttered the README structure.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add comprehensive documentation for Claude Code instances working with this repository, including build commands, architecture overview, testing instructions, and key design patterns. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Ensure ColQwenRAG always sets the processor when the model reloads on CPU due to memory constraints, preventing embed-time attribute errors. Co-authored-by: Cursor <cursoragent@cursor.com>
Fail fast with a clear error when transformers>=4.46 is installed, and delay colpali_engine imports until after the version check to avoid HybridCache import crashes. Co-authored-by: Cursor <cursoragent@cursor.com>
Clean up unused type suppression comments flagged by ty across apps and core packages. Co-authored-by: Cursor <cursoragent@cursor.com>
Normalize torch.compile call formatting after ruff. Co-authored-by: Cursor <cursoragent@cursor.com>
…eddings Fix/colqwen empty embeddings
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Ensure all index-* CLI commands accept embedding model/mode arguments to match their builder usage. Co-authored-by: Cursor <cursoragent@cursor.com>
…ive-ingestion-and-formats Comprehensive Data Ingestion & New Format Support
… suppressed search output (yichuan-w#242) - Add `leann watch` CLI command that compares current files against the last Merkle tree checkpoint and reports added/removed/modified files with their associated chunk IDs. - Integrate FileSynchronizer into `leann build` to create initial snapshots and persist sync config (sync_roots.json). - Prepend line numbers to code file chunks (e.g. `42|def foo():`) so search results display exact line locations for code navigation. - Trim partial first lines in code chunks caused by character-based overlap to ensure every chunk starts at a clean line boundary. - Fix `suppress_cpp_output` swallowing Python print() along with C++ output by redirecting sys.stdout/sys.stderr to saved fd copies while OS-level fds go to /dev/null. - Update README with watch command documentation in Quick Start, Usage Examples, and Complete CLI Reference sections.
…stence check (yichuan-w#245) - Remove erroneous first-line trimming in create_ast_chunks that stripped function signatures (e.g. `def hello():`) by assuming all chunks have line-number prefixes starting with digits - Move line number prepending from before AST chunking to after, so the AST parser receives valid source code instead of `1|def hello():` which breaks syntax tree parsing and causes fallback to naive text splitting - Fix index existence check in base_rag_example.py to look for the actual .meta.json file instead of just the directory (empty temp dirs always exist) Co-authored-by: Cursor <cursoragent@cursor.com>
…an-w#217) Exposes LEANN as a LlamaIndex BaseRetriever so it can plug into any RetrieverQueryEngine, agent pipeline, or existing LlamaIndex RAG workflow. - LeannRetriever: wraps LeannSearcher as a LlamaIndex retriever - LeannHybridRetriever: fuses vector + BM25 via weighted score fusion - Returns NodeWithScore objects with full metadata preservation - Supports async retrieval interface - 11 tests covering conversion, delegation, fusion, and imports https://claude.ai/code/session_01M6abMs1YzF6yhh13YerDPT
BM25Index.search() returns list[tuple[str, float]] but _fuse() accessed results as dicts (r["id"], r["score"]). This would crash with TypeError at runtime when hybrid search was used. https://claude.ai/code/session_01M6abMs1YzF6yhh13YerDPT
When all vector or BM25 results have the same score, the min-max
normalization span is zero. Previously this defaulted to dividing by
1.0, giving all items a normalized score of 0.0 ("irrelevant"). Now
returns 0.5 ("equal relevance") which correctly represents the
uncertainty rather than penalizing uniform-score sources.
https://claude.ai/code/session_01M6abMs1YzF6yhh13YerDPT
LeannHybridRetriever had its own _fuse() method that re-implemented the same min-max normalization + weighted score combination already in hybrid.py (via LeannSearcher.search(sparse_score_ratio=...)). This meant fixes to the fusion logic (like the 0.5 normalization fix) had to be applied in two places. Delete the duplicated _fuse(), _get_bm25(), and bm25_db_path plumbing. The retriever now delegates entirely to LeannSearcher.search() with sparse_score_ratio=bm25_weight, which handles BM25 index loading, overfetch, and fusion internally. Also fixes the empty-text bug where BM25-only results (not in vector top-K) appeared as empty TextNodes because text_lookup.get(doc_id, "") returned "". LeannSearcher enriches all fused results from the passage store, so this is no longer possible. https://claude.ai/code/session_01M6abMs1YzF6yhh13YerDPT
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes yichuan-w#217
What changed and why:
LeannRetriever(BaseRetriever)— drops LEANN into any LlamaIndex RAG pipeline. Implements_retrieve()and_aretrieve()(async just delegates to sync since LEANN's search is CPU-bound, not I/O-bound). Takesindex_path,top_k,complexityas constructor args, instantiatesLeannSearcherinternally.LeannHybridRetriever(BaseRetriever)— vector + BM25 through LlamaIndex. Delegates toLeannSearcher.search(sparse_score_ratio=...), reusing the hybrid search infrastructure from PR 2. If PR 2 isn't merged yet, this still works —sparse_score_ratiois ignored when no FTS5 index exists._results_to_nodes()maps LEANN'sSearchResultobjects (which use.text,.id,.score,.metadata) to LlamaIndex'sNodeWithScore(TextNode(...)). Metadata is passed through as-is so LlamaIndex's postprocessors can filter on it.leann/integrations/subpackage to establish a pattern for future framework integrations (LangChain, Haystack, etc.) without polluting the top-level namespace.