Request to Modify Code to Enable TEXT_SPLITTER_EMBEDDING_MODEL Customization through Configuration File

I am looking to create a Chinese RAG demo service using RetrievalAugmentedGeneration. 

However, I encountered an issue where the default SentenceTransformersTokenTextSplitter model used in the RetrievalAugmentedGeneration/common/utils.py file is hardcoded as 'intfloat/e5-large-v2'. This model generates a significant number of [UNK] tokens when processing Chinese text. 

I would like the ability to specify a specific model for the text splitter, similar to how the embedding model can be specified through the config.yaml file. 

Thank you for your assistance and support.
<img width="534" alt="image" src="https://github.com/NVIDIA/GenerativeAIExamples/assets/153881661/60d3ed3c-177d-4dc6-a1df-aeb15314f7a5">


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Request to Modify Code to Enable TEXT_SPLITTER_EMBEDDING_MODEL Customization through Configuration File #27

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Request to Modify Code to Enable TEXT_SPLITTER_EMBEDDING_MODEL Customization through Configuration File #27

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions