Advanced Strategies: Hierarchical chunking, semantic similarity-based splitting

## Summary
Implement advanced chunking strategies to improve document segmentation and retrieval for RAG pipelines. These extend current strategies (fixed-size, sliding-window, paragraph, recursive-character) by leveraging hierarchical document structure and deep semantic cues.

### Hierarchical Chunking
- Build multi-level chunk hierarchies: e.g., split on Markdown headlines (h1/h2/h3) → sections → paragraphs → sentences.
- API: `hierarchical_chunk(text, levels=['section', 'paragraph', 'sentence'])` in `src/chunker.py`.
- Each chunk should include metadata: `id`, `parent_id`, `level`, `start_char`, `end_char`, `token_count`, `source_path`.
- CLI: select via `--strategy hierarchical`, with tunable levels and fallback to paragraph splitting.

### Semantic Similarity-based Splitting
- Detect topic boundaries via semantic embeddings (using sentence-transformers or LangChain embeddings).
- API: `semantic_split(text, model='all-MiniLM-L6-v2', threshold=0.7)` in `src/chunker.py`.
- Split at points where similarity with neighboring sentences drops below threshold (changepoint detection).
- CLI: select via `--strategy semantic-embedding`, with options for model and threshold, configurable in CLI.
- Compare results with current recursive-character strategy (LangChain integration).

## Implementation Plan
- Add new strategy registrations to `STRATEGIES` in src/chunker.py.
- Write modular functions for hierarchical and semantic chunking.
- Integrate into CLI and allow strategy selection.
- Add optional dependencies for `sentence-transformers` (rag-chunk[embeddings]).
- Add output and chunk metadata to improve traceability for hierarchical chunks.
- Extend existing tests in tests/ and add example files under examples/ for notebook demonstrations.

## Motivation
- Enable multi-resolution document retrieval and indexing.
- Improve semantic relevance and reduce context fragmentation for RAG applications.

## Acceptance Criteria
- Hierarchical and semantic strategies available via CLI (`--strategy` flag)
- New functions in `src/chunker.py` with clear API signatures
- Tests and example notebooks for both strategies
- Benchmark/evaluation harness extended to new strategies
- Optional dependencies well-documented in README

## References
- See roadmap in [README.md](https://github.com/messkan/rag-chunk/blob/main/README.md)
- Related code: [src/chunker.py](https://github.com/messkan/rag-chunk/blob/main/src/chunker.py), [tests](https://github.com/messkan/rag-chunk/tree/main/tests), [examples](https://github.com/messkan/rag-chunk/tree/main/examples)

---
Labels: `enhancement`, `strategy`, `roadmap`, `semantic`, `hierarchical`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advanced Strategies: Hierarchical chunking, semantic similarity-based splitting #4

Summary

Hierarchical Chunking

Semantic Similarity-based Splitting

Implementation Plan

Motivation

Acceptance Criteria

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Advanced Strategies: Hierarchical chunking, semantic similarity-based splitting #4

Description

Summary

Hierarchical Chunking

Semantic Similarity-based Splitting

Implementation Plan

Motivation

Acceptance Criteria

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions