Skip to content

Integrate trueno-rag for enhanced text/document ML #125

@noahgift

Description

@noahgift

Summary

Add optional integration with trueno-rag (v0.1.3) to enhance the text/ module with RAG pipeline capabilities.

Motivation

aprender's text/ module currently provides:

  • Tokenization (whitespace, word, char)
  • Stop words filtering
  • Porter stemming

trueno-rag would add document-level processing capabilities useful for ML pipelines.

Proposed Integration

[features]
rag = ["trueno-rag"]

Features to integrate

Feature Description Use Case
6 chunking strategies Recursive, semantic, fixed, sentence, paragraph, markdown Document preprocessing for training
Hybrid retrieval Dense + BM25 Training data retrieval
Reranking Cross-encoder support Result quality improvement
Metrics Recall, MRR, NDCG Retrieval evaluation

Example API

use aprender::text::DocumentChunker;

let chunker = DocumentChunker::recursive(chunk_size: 512, overlap: 64);
let chunks = chunker.chunk(&document);

// Use chunks for training data preparation
for chunk in chunks {
    let features = extract_features(&chunk);
    model.train(&features);
}

Potential Use Cases

  1. Training data preparation - Chunk large documents for sequence models
  2. Semantic code search - Enhance CITL module with retrieval
  3. Model documentation search - Search model zoo documentation

Priority

MEDIUM - Enhances text processing capabilities for document-based ML.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions