Skip to content

Advanced Strategies: Hierarchical chunking, semantic similarity-based splitting #4

@messkan

Description

@messkan

Summary

Implement advanced chunking strategies to improve document segmentation and retrieval for RAG pipelines. These extend current strategies (fixed-size, sliding-window, paragraph, recursive-character) by leveraging hierarchical document structure and deep semantic cues.

Hierarchical Chunking

  • Build multi-level chunk hierarchies: e.g., split on Markdown headlines (h1/h2/h3) → sections → paragraphs → sentences.
  • API: hierarchical_chunk(text, levels=['section', 'paragraph', 'sentence']) in src/chunker.py.
  • Each chunk should include metadata: id, parent_id, level, start_char, end_char, token_count, source_path.
  • CLI: select via --strategy hierarchical, with tunable levels and fallback to paragraph splitting.

Semantic Similarity-based Splitting

  • Detect topic boundaries via semantic embeddings (using sentence-transformers or LangChain embeddings).
  • API: semantic_split(text, model='all-MiniLM-L6-v2', threshold=0.7) in src/chunker.py.
  • Split at points where similarity with neighboring sentences drops below threshold (changepoint detection).
  • CLI: select via --strategy semantic-embedding, with options for model and threshold, configurable in CLI.
  • Compare results with current recursive-character strategy (LangChain integration).

Implementation Plan

  • Add new strategy registrations to STRATEGIES in src/chunker.py.
  • Write modular functions for hierarchical and semantic chunking.
  • Integrate into CLI and allow strategy selection.
  • Add optional dependencies for sentence-transformers (rag-chunk[embeddings]).
  • Add output and chunk metadata to improve traceability for hierarchical chunks.
  • Extend existing tests in tests/ and add example files under examples/ for notebook demonstrations.

Motivation

  • Enable multi-resolution document retrieval and indexing.
  • Improve semantic relevance and reduce context fragmentation for RAG applications.

Acceptance Criteria

  • Hierarchical and semantic strategies available via CLI (--strategy flag)
  • New functions in src/chunker.py with clear API signatures
  • Tests and example notebooks for both strategies
  • Benchmark/evaluation harness extended to new strategies
  • Optional dependencies well-documented in README

References


Labels: enhancement, strategy, roadmap, semantic, hierarchical

Metadata

Metadata

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions