Summary
Add optional integration with trueno-rag (v0.1.3) to enhance the text/ module with RAG pipeline capabilities.
Motivation
aprender's text/ module currently provides:
- Tokenization (whitespace, word, char)
- Stop words filtering
- Porter stemming
trueno-rag would add document-level processing capabilities useful for ML pipelines.
Proposed Integration
[features]
rag = ["trueno-rag"]
Features to integrate
| Feature |
Description |
Use Case |
| 6 chunking strategies |
Recursive, semantic, fixed, sentence, paragraph, markdown |
Document preprocessing for training |
| Hybrid retrieval |
Dense + BM25 |
Training data retrieval |
| Reranking |
Cross-encoder support |
Result quality improvement |
| Metrics |
Recall, MRR, NDCG |
Retrieval evaluation |
Example API
use aprender::text::DocumentChunker;
let chunker = DocumentChunker::recursive(chunk_size: 512, overlap: 64);
let chunks = chunker.chunk(&document);
// Use chunks for training data preparation
for chunk in chunks {
let features = extract_features(&chunk);
model.train(&features);
}
Potential Use Cases
- Training data preparation - Chunk large documents for sequence models
- Semantic code search - Enhance CITL module with retrieval
- Model documentation search - Search model zoo documentation
Priority
MEDIUM - Enhances text processing capabilities for document-based ML.
References
Summary
Add optional integration with trueno-rag (v0.1.3) to enhance the
text/module with RAG pipeline capabilities.Motivation
aprender's
text/module currently provides:trueno-rag would add document-level processing capabilities useful for ML pipelines.
Proposed Integration
Features to integrate
Example API
Potential Use Cases
Priority
MEDIUM - Enhances text processing capabilities for document-based ML.
References
docs/specifications/include-latest-trueno-features.md