Open
Description
I am looking to create a Chinese RAG demo service using RetrievalAugmentedGeneration.
However, I encountered an issue where the default SentenceTransformersTokenTextSplitter model used in the RetrievalAugmentedGeneration/common/utils.py file is hardcoded as 'intfloat/e5-large-v2'. This model generates a significant number of [UNK] tokens when processing Chinese text.
I would like the ability to specify a specific model for the text splitter, similar to how the embedding model can be specified through the config.yaml file.