Description
What/Why
What problems are you trying to solve?
We're implementing analyzer-based neural sparse query support in the neural-search plugin, which requires BERT tokenizer functionality. (RFC link).
The BERT tokenizer is widely used by OpenSearch pretrained text embedding models. Customers can use the analyzer to debug their input. Besides, the chunking processor also depends on analyzers to count token numbers.
However, we've encountered an issue that native libraries (i.e. DJL) cannot be loaded by two different classloaders. Since ml-commons already has dependencies on DJL and handles native library loading, we need to implement the BERT tokenizer functionality within ml-commons to avoid native library loading conflicts.
What are you proposing?
Build bert-base-uncased
tokenizer and google-bert/bert-base-multilingual-uncased
as OpenSearch built in analyzers in ml-common. This will include:
The implementation will leverage DJL's HuggingFaceTokenizer, which is already a dependency in ml-commons. This will serve as the foundation for analyzer-based neural sparse queries in the neural-search plugin.
Implementations
The main ambiguity is how we load the tokenizer config file (example) and the idf file (saves the query token weight for neural sparse. example). Here are several options:
Option 1: Load from java resources dir
pros: no remote connect, most efficient
cons: increase the size for ml-commons jar file
Option 2: Fetch from Hugging Face
pros: easy to implement, better extensibility for custom tokenizer
cons: security challenges if there is mallicious tokenizer.config at HF; can not work if remote website fails
Option 3: Fetch from OpenSearch artifacts website
Fetch from https://artifacts.opensearch.org/ , the model file is identical with sparse tokenize files.
pros: don't increase jar size, the file is managed by opensearch
cons: need to hard code the map of analyzer name and model artifact URL at some place.
Note for all options, we can build it in a lazy load fashion. That is we don't load one tokenizer multiple times, and only load the tokenizer when it's really invoked.
Open discussion
We can also make the HF analyzer configurable by providing HF model id. But this need to be further reviewed by security teams. The custom analyzer is not in the scope now, but we do want to collect feedbacks about this.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status