Advanced Retrieval-Augmented Generation system with contextualized embeddings, smart batching, and reranking for scientific research papers.
A production-ready Model Context Protocol (MCP) server with a fully Contextualized pipeline for academic research, optimized for scientific literature retrieval.
Note: The legacy βhybrid modeβ (Voyage-3-large embeddings) has been removed. All search paths now use contextualized embeddings + BM25 fusion; function names are preserved for compatibility.
- Contextualized Search (v1.7.0): Powered by Voyage-Context-3 (32k context window) for superior understanding of document structure.
- Smart Batching: Robust handling of massive documents (700k+ tokens) with automatic batching and timeout management.
- Professional TUI: New
ragdoc-menu.pyinterface with arrow navigation and real-time indexing feedback. - Evaluation System: Comprehensive RAG metrics (Recall, Precision, MRR, NDCG) with automated benchmarking.
- Cohere Reranking: v3.5 for intelligent result ranking.
- MCP Integration: Native integration with Claude Desktop and compatible applications.
- Incremental Indexing: MD5-based change detection for efficient updates.
- Installation
- Configuration
- Usage
- Evaluation & Quality Metrics
- Architecture
- Troubleshooting
- Performance
- Contributing
- Python 3.10 or higher
- API Keys: Voyage AI, Cohere (optional)
- 4GB+ RAM recommended
# 1. Clone the repository
git clone https://github.com/tofunori/Ragdoc.git
cd Ragdoc
# 2. Create virtual environment
python -m venv ragdoc-env
# Windows
ragdoc-env\Scripts\activate
# macOS/Linux
source ragdoc-env/bin/activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. Configure API keys (see Configuration section)# Create virtual environment
python -m venv ragdoc-env
.\ragdoc-env\Scripts\Activate.ps1
# Install dependencies
pip install -r requirements.txt
# Set environment variables
$env:VOYAGE_API_KEY = "your_voyage_api_key"
$env:COHERE_API_KEY = "your_cohere_api_key"# Create virtual environment
python3 -m venv ragdoc-env
source ragdoc-env/bin/activate
# Install dependencies
pip install -r requirements.txt
# Set environment variables
export VOYAGE_API_KEY="your_voyage_api_key"
export COHERE_API_KEY="your_cohere_api_key"Create a .env file in the project root (copy from .env.example):
VOYAGE_API_KEY=your_voyage_api_key
COHERE_API_KEY=your_cohere_api_key-
Voyage AI (required)
- Sign up: https://voyageai.com/
- Model used: voyage-context-3 (32k context)
- Cost: ~$0.06 per 1M tokens (Contextualized)
-
Cohere (optional, for reranking)
- Sign up: https://cohere.com/
- Model used: rerank-v3.5
- Free tier available
- Install Claude Desktop: https://claude.ai/download
- Configure MCP server in Claude settings:
Windows: %APPDATA%\Claude\claude_desktop_config.json
macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
{
"mcpServers": {
"ragdoc": {
"command": "python",
"args": ["src/server.py"],
"cwd": "/path/to/Ragdoc"
}
}
}Configuration files are located in config/:
models.yaml- Embedding and reranking modelschunking.yaml- Chunking pipeline settingsdatabase.yaml- ChromaDB and HNSW parameters
See config/README.md for detailed documentation.
Once configured, use directly in Claude conversations:
Search for information about glacier albedo
Find articles about ice mass measurement techniques
What are the remote sensing methods for albedo analysis?
semantic_search_hybrid(query, top_k=10, alpha=0.5)- Contextualized search (BM25 + contextualized embeddings) with rerankingsearch_by_source(query, sources, top_k=10, alpha=0.5)- Search limited to specific documents
list_documents()- List all indexed documentsget_document_content(source, format="markdown", max_length=None)- Retrieve complete document contentget_chunk_with_context(chunk_id, context_size=2, highlight=True)- Show chunk with surrounding context
get_indexation_status()- Database statistics
# Contextualized search (BM25 + contextualized embeddings) - alpha=0.5 is balanced fusion (default)
semantic_search_hybrid("black carbon impact on glacier albedo", top_k=10, alpha=0.5)
# Adjust semantic/lexical weight (alpha=0.5 = equal weight)
semantic_search_hybrid("remote sensing albedo measurement", alpha=0.5)
# Search in specific documents only
search_by_source("glacier albedo", sources=["1982_RGSP.md"])
search_by_source("ice mass balance", sources=["Warren_1982.md", "Painter_2009.md"], top_k=5)
# Get document list
list_documents()# Read complete document in markdown format
get_document_content("1982_RGSP.md", format="markdown")
# Read document as plain text with length limit
get_document_content("1982_RGSP.md", format="text", max_length=5000)
# View document as individual chunks with metadata
get_document_content("1982_RGSP.md", format="chunks")# Show chunk with 2 surrounding chunks on each side (default)
get_chunk_with_context("1982_RGSP_chunk_042", context_size=2, highlight=True)
# Show more context (5 chunks before and after)
get_chunk_with_context("1982_RGSP_chunk_042", context_size=5)
# Show context without highlighting
get_chunk_with_context("1982_RGSP_chunk_042", context_size=3, highlight=False)# Get database statistics
get_indexation_status()RAGDOC includes a comprehensive evaluation system to measure and optimize retrieval quality:
# Quick Start: Generate test dataset and evaluate
python scripts/generate_test_dataset.py --n_queries 30
python tests/evaluate_ragdoc.py
# View results
cat tests/results/evaluation_report_latest.mdMetrics Measured:
- Recall@K: What % of relevant documents are found in top-K results?
- Precision@K: What % of top-K results are relevant?
- MRR (Mean Reciprocal Rank): How early does first relevant result appear?
- NDCG@K: How well are results ranked?
Typical RAGDOC Performance:
- Recall@10: 96-97% (Outstanding)
- MRR: 91-92% (First result usually relevant)
- NDCG@10: 92-93% (Excellent ranking quality)
Configuration Tuning:
# Test different alpha values (BM25 vs Semantic weight)
python tests/evaluate_ragdoc.py --alpha 0.3 0.5 0.7 1.0
# Custom dataset
python tests/evaluate_ragdoc.py --dataset tests/test_datasets/my_queries.jsonOutput Files:
evaluation_report_latest.md- Comparison reportevaluation_detailed_latest.json- Full resultsevaluation_aggregate_latest.csv- Metrics table
See docs/EVALUATION_GUIDE.md for complete documentation.
# 1. Add markdown files to articles_markdown/
cp your_paper.md articles_markdown/
# 2. Run the Menu
python ragdoc-menu.py
# Select "Indexation IncrΓ©mentale"Query
β
βββββββββββββββββββββββββββββββ
β BM25 Search (rank-bm25) β β Top 100 candidates (lexical)
β Voyage-Context-3 Semantic β β Top 100 candidates (semantic)
βββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββ
β Reciprocal Rank Fusion β β Top 50 merged results
β (Weighted RRF) β
βββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββ
β Cohere v3.5 Reranking β β Top 10 final results
βββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββ
β Context Window Expansion β β Results with adjacent chunks
βββββββββββββββββββββββββββββββ
- rank-bm25: BM25 Okapi for lexical search
- Voyage AI: voyage-context-3 embeddings (1024 dimensions, 32k context)
- ChromaDB 0.5.0+: HNSW-optimized vector database
- Cohere v3.5: Intelligent result reranking
- FastMCP: High-performance MCP server
- Rich & Questionary: Professional TUI
- 100+ research papers on glaciology and climate science
- 24,884+ chunks with contextualized indexing
- Rich metadata (source, chunk_index, total_chunks, doc_hash, indexed_date)
- Continuous updates with incremental indexing
ERROR: VOYAGE_API_KEY not found
Solution: Check environment variables or .env file configuration
ModuleNotFoundError: No module named 'fastmcp'
Solution: Reactivate virtual environment and reinstall:
source ragdoc-env/bin/activate # macOS/Linux
# or
.\ragdoc-env\Scripts\activate # Windows
pip install -r requirements.txtCollection empty or not found
Solution: Run indexation:
python ragdoc-menu.py- Check internet connection (Voyage AI embeddings require API calls)
- Enable GPU if available (CUDA)
- Reduce number of results in searches
- Use local ChromaDB server for faster access
- Logs: Check console output for detailed errors
- Status: Use
get_indexation_status()for diagnostics - Reset: Delete
chroma_db_new/and reindex if necessary
- Search: 2-3s for contextualized + BM25 fusion + reranking (10 results)
- Indexing: ~2min/document with contextualized embeddings
- Retrieval: ~25k chunks indexed and validated
Contributions are welcome! To contribute:
- Fork the project
- Create a feature branch
- Add your documents to
articles_markdown/ - Run indexation:
python ragdoc-menu.py - Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- Built with Chonkie for advanced chunking
- Powered by Voyage AI embeddings
- Enhanced with Cohere reranking
- Integrated with Claude Desktop via MCP
Developed for the scientific research community π¬
For questions or issues, please open an issue on GitHub.