-
Notifications
You must be signed in to change notification settings - Fork 155
[RFC] Add BERT Tokenizer as OpenSearch built-in analyzer in ml-commons #3708
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
For option2 and 3, they are both using remote source. Does our API have the place to config the url? Or we hard code it. |
Since we're building built-in analyzer, there is no configurable parameters and we need to hard code the map at some place |
It takes around 900KB for bert English and 4MB for multilingual BERT. |
Both option 2 and 3 are need network access to remote, we are adding more dependency. Considering the size is not large 4mb compare to existing ml plugin which is 300+mb, assemble it into jar file may not a big concern about the file size |
4MB for multilingual BERT is large for option 1. Thinking for a node, loading all the opensearch plugins already takes up a lot of disk. but option 1 is better security. is this multilingual BERT tokenizer the only option? anyway to package essential tokenizers in resources and maybe fetching more tokenizers when needed through remote |
Do you think it makes sense to package EN/multilingual tokenizer in resource for the first release, and we can have follow up work on loading from remote once we release more tokenizers or users cut issues to reduce the artifact size? And how about pack it to a zip so we can reduce the total space by 0.5x |
@mingshl I've updated the PR to load from zip artifact. the file size is reduced to 0.5x |
that's much better for option 1. I am thinking more about option 3. the main concern for option 2 is that it's fetching from hugging face link, so it's hard the maintain the security because we don't control the files. option 3 is definitely better for option 2. option 3 is a common way that we load local models. Check out this config file, we have it hosted with opensearch.org, we can do the same for bert tokenizer https://artifacts.opensearch.org/models/ml-models/huggingface/sentence-transformers/all-distilroberta-v1/1.0.1/torch_script/config.json I think the mapping is not a big effort, we have similar design for the neural field feature. @bzhangam is working on a neural field feature and he also will have some mapping store somewhere. |
This use case is not the same as our semantic field use case where we only we need to add some simple extra config to each model's config. But neural plugin will need to use it to support the sparse model for the existing neural sparse query and the semantic field use case with the sparse model. Especially to support the one click semantic feature with the sparse model. But for this change I think it's better to use the option 3 avoid increasing the file size. And the lazy loading is really a good idea which I think we should implement. For the cons about hardcoding the mapping I don't think it's a big concern. Even you have the file in the ml-common plugin you still need to maintain the mapping to decide which file the analyzer/tokenizer should use. |
I've changed to option 3 @mingshl @bzhangam Since the multilingual sparse tokenize is not released yet, will create a follow up PR to add mbert-uncased analyzer. the PR to add sparse tokenizer pipeline: opensearch-project/opensearch-py-ml#459 |
What/Why
What problems are you trying to solve?
We're implementing analyzer-based neural sparse query support in the neural-search plugin, which requires BERT tokenizer functionality. (RFC link).
The BERT tokenizer is widely used by OpenSearch pretrained text embedding models. Customers can use the analyzer to debug their input. Besides, the chunking processor also depends on analyzers to count token numbers.
However, we've encountered an issue that native libraries (i.e. DJL) cannot be loaded by two different classloaders. Since ml-commons already has dependencies on DJL and handles native library loading, we need to implement the BERT tokenizer functionality within ml-commons to avoid native library loading conflicts.
What are you proposing?
Build
bert-base-uncased
tokenizer andgoogle-bert/bert-base-multilingual-uncased
as OpenSearch built in analyzers in ml-common. This will include:The implementation will leverage DJL's HuggingFaceTokenizer, which is already a dependency in ml-commons. This will serve as the foundation for analyzer-based neural sparse queries in the neural-search plugin.
Implementations
The main ambiguity is how we load the tokenizer config file (example) and the idf file (saves the query token weight for neural sparse. example). Here are several options:
Option 1: Load from java resources dir
pros: no remote connect, most efficient
cons: increase the size for ml-commons jar file
Option 2: Fetch from Hugging Face
pros: easy to implement, better extensibility for custom tokenizer
cons: security challenges if there is mallicious tokenizer.config at HF; can not work if remote website fails
Option 3: Fetch from OpenSearch artifacts website
Fetch from https://artifacts.opensearch.org/ , the model file is identical with sparse tokenize files.
pros: don't increase jar size, the file is managed by opensearch
cons: need to hard code the map of analyzer name and model artifact URL at some place.
Note for all options, we can build it in a lazy load fashion. That is we don't load one tokenizer multiple times, and only load the tokenizer when it's really invoked.
Open discussion
We can also make the HF analyzer configurable by providing HF model id. But this need to be further reviewed by security teams. The custom analyzer is not in the scope now, but we do want to collect feedbacks about this.
The text was updated successfully, but these errors were encountered: