Summary
Implement advanced chunking strategies to improve document segmentation and retrieval for RAG pipelines. These extend current strategies (fixed-size, sliding-window, paragraph, recursive-character) by leveraging hierarchical document structure and deep semantic cues.
Hierarchical Chunking
- Build multi-level chunk hierarchies: e.g., split on Markdown headlines (h1/h2/h3) → sections → paragraphs → sentences.
- API:
hierarchical_chunk(text, levels=['section', 'paragraph', 'sentence']) in src/chunker.py.
- Each chunk should include metadata:
id, parent_id, level, start_char, end_char, token_count, source_path.
- CLI: select via
--strategy hierarchical, with tunable levels and fallback to paragraph splitting.
Semantic Similarity-based Splitting
- Detect topic boundaries via semantic embeddings (using sentence-transformers or LangChain embeddings).
- API:
semantic_split(text, model='all-MiniLM-L6-v2', threshold=0.7) in src/chunker.py.
- Split at points where similarity with neighboring sentences drops below threshold (changepoint detection).
- CLI: select via
--strategy semantic-embedding, with options for model and threshold, configurable in CLI.
- Compare results with current recursive-character strategy (LangChain integration).
Implementation Plan
- Add new strategy registrations to
STRATEGIES in src/chunker.py.
- Write modular functions for hierarchical and semantic chunking.
- Integrate into CLI and allow strategy selection.
- Add optional dependencies for
sentence-transformers (rag-chunk[embeddings]).
- Add output and chunk metadata to improve traceability for hierarchical chunks.
- Extend existing tests in tests/ and add example files under examples/ for notebook demonstrations.
Motivation
- Enable multi-resolution document retrieval and indexing.
- Improve semantic relevance and reduce context fragmentation for RAG applications.
Acceptance Criteria
- Hierarchical and semantic strategies available via CLI (
--strategy flag)
- New functions in
src/chunker.py with clear API signatures
- Tests and example notebooks for both strategies
- Benchmark/evaluation harness extended to new strategies
- Optional dependencies well-documented in README
References
Labels: enhancement, strategy, roadmap, semantic, hierarchical
Summary
Implement advanced chunking strategies to improve document segmentation and retrieval for RAG pipelines. These extend current strategies (fixed-size, sliding-window, paragraph, recursive-character) by leveraging hierarchical document structure and deep semantic cues.
Hierarchical Chunking
hierarchical_chunk(text, levels=['section', 'paragraph', 'sentence'])insrc/chunker.py.id,parent_id,level,start_char,end_char,token_count,source_path.--strategy hierarchical, with tunable levels and fallback to paragraph splitting.Semantic Similarity-based Splitting
semantic_split(text, model='all-MiniLM-L6-v2', threshold=0.7)insrc/chunker.py.--strategy semantic-embedding, with options for model and threshold, configurable in CLI.Implementation Plan
STRATEGIESin src/chunker.py.sentence-transformers(rag-chunk[embeddings]).Motivation
Acceptance Criteria
--strategyflag)src/chunker.pywith clear API signaturesReferences
Labels:
enhancement,strategy,roadmap,semantic,hierarchical