Skip to content

[RFC] Add BERT Tokenizer as OpenSearch built-in analyzer in ml-commons #3708

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
zhichao-aws opened this issue Apr 3, 2025 · 10 comments · May be fixed by #3719
Open

[RFC] Add BERT Tokenizer as OpenSearch built-in analyzer in ml-commons #3708

zhichao-aws opened this issue Apr 3, 2025 · 10 comments · May be fixed by #3719
Assignees

Comments

@zhichao-aws
Copy link
Member

zhichao-aws commented Apr 3, 2025

What/Why

What problems are you trying to solve?

We're implementing analyzer-based neural sparse query support in the neural-search plugin, which requires BERT tokenizer functionality. (RFC link).

The BERT tokenizer is widely used by OpenSearch pretrained text embedding models. Customers can use the analyzer to debug their input. Besides, the chunking processor also depends on analyzers to count token numbers.

However, we've encountered an issue that native libraries (i.e. DJL) cannot be loaded by two different classloaders. Since ml-commons already has dependencies on DJL and handles native library loading, we need to implement the BERT tokenizer functionality within ml-commons to avoid native library loading conflicts.

What are you proposing?

Build bert-base-uncased tokenizer and google-bert/bert-base-multilingual-uncased as OpenSearch built in analyzers in ml-common. This will include:

The implementation will leverage DJL's HuggingFaceTokenizer, which is already a dependency in ml-commons. This will serve as the foundation for analyzer-based neural sparse queries in the neural-search plugin.

Implementations

The main ambiguity is how we load the tokenizer config file (example) and the idf file (saves the query token weight for neural sparse. example). Here are several options:

Option 1: Load from java resources dir

pros: no remote connect, most efficient
cons: increase the size for ml-commons jar file

Option 2: Fetch from Hugging Face

pros: easy to implement, better extensibility for custom tokenizer
cons: security challenges if there is mallicious tokenizer.config at HF; can not work if remote website fails

Option 3: Fetch from OpenSearch artifacts website

Fetch from https://artifacts.opensearch.org/ , the model file is identical with sparse tokenize files.

pros: don't increase jar size, the file is managed by opensearch
cons: need to hard code the map of analyzer name and model artifact URL at some place.

Note for all options, we can build it in a lazy load fashion. That is we don't load one tokenizer multiple times, and only load the tokenizer when it's really invoked.

Open discussion

We can also make the HF analyzer configurable by providing HF model id. But this need to be further reviewed by security teams. The custom analyzer is not in the scope now, but we do want to collect feedbacks about this.

@xinyual
Copy link
Collaborator

xinyual commented Apr 3, 2025

For option2 and 3, they are both using remote source. Does our API have the place to config the url? Or we hard code it.

@zhichao-aws
Copy link
Member Author

Does our API have the place to config the url? Or we hard code it.

Since we're building built-in analyzer, there is no configurable parameters and we need to hard code the map at some place

@zhichao-aws
Copy link
Member Author

It takes around 900KB for bert English and 4MB for multilingual BERT.

@Hailong-am
Copy link
Contributor

Both option 2 and 3 are need network access to remote, we are adding more dependency. Considering the size is not large 4mb compare to existing ml plugin which is 300+mb, assemble it into jar file may not a big concern about the file size

@mingshl
Copy link
Collaborator

mingshl commented Apr 12, 2025

4MB for multilingual BERT is large for option 1. Thinking for a node, loading all the opensearch plugins already takes up a lot of disk. but option 1 is better security. is this multilingual BERT tokenizer the only option? anyway to package essential tokenizers in resources and maybe fetching more tokenizers when needed through remote

@zhichao-aws
Copy link
Member Author

zhichao-aws commented Apr 13, 2025

4MB for multilingual BERT is large for option 1. Thinking for a node, loading all the opensearch plugins already takes up a lot of disk. but option 1 is better security. is this multilingual BERT tokenizer the only option? anyway to package essential tokenizers in resources and maybe fetching more tokenizers when needed through remote

Do you think it makes sense to package EN/multilingual tokenizer in resource for the first release, and we can have follow up work on loading from remote once we release more tokenizers or users cut issues to reduce the artifact size?

And how about pack it to a zip so we can reduce the total space by 0.5x

@zhichao-aws
Copy link
Member Author

@mingshl I've updated the PR to load from zip artifact. the file size is reduced to 0.5x

@mingshl
Copy link
Collaborator

mingshl commented Apr 14, 2025

@mingshl I've updated the PR to load from zip artifact. the file size is reduced to 0.5x

that's much better for option 1.

I am thinking more about option 3. the main concern for option 2 is that it's fetching from hugging face link, so it's hard the maintain the security because we don't control the files. option 3 is definitely better for option 2.

option 3 is a common way that we load local models. Check out this config file, we have it hosted with opensearch.org, we can do the same for bert tokenizer https://artifacts.opensearch.org/models/ml-models/huggingface/sentence-transformers/all-distilroberta-v1/1.0.1/torch_script/config.json

I think the mapping is not a big effort, we have similar design for the neural field feature. @bzhangam is working on a neural field feature and he also will have some mapping store somewhere.

@bzhangam
Copy link

@mingshl I've updated the PR to load from zip artifact. the file size is reduced to 0.5x

that's much better for option 1.

I am thinking more about option 3. the main concern for option 2 is that it's fetching from hugging face link, so it's hard the maintain the security because we don't control the files. option 3 is definitely better for option 2.

option 3 is a common way that we load local models. Check out this config file, we have it hosted with opensearch.org, we can do the same for bert tokenizer https://artifacts.opensearch.org/models/ml-models/huggingface/sentence-transformers/all-distilroberta-v1/1.0.1/torch_script/config.json

I think the mapping is not a big effort, we have similar design for the neural field feature. @bzhangam is working on a neural field feature and he also will have some mapping store somewhere.

This use case is not the same as our semantic field use case where we only we need to add some simple extra config to each model's config. But neural plugin will need to use it to support the sparse model for the existing neural sparse query and the semantic field use case with the sparse model. Especially to support the one click semantic feature with the sparse model.

But for this change I think it's better to use the option 3 avoid increasing the file size. And the lazy loading is really a good idea which I think we should implement. For the cons about hardcoding the mapping I don't think it's a big concern. Even you have the file in the ml-common plugin you still need to maintain the mapping to decide which file the analyzer/tokenizer should use.

@zhichao-aws
Copy link
Member Author

zhichao-aws commented Apr 15, 2025

I've changed to option 3 @mingshl @bzhangam

Since the multilingual sparse tokenize is not released yet, will create a follow up PR to add mbert-uncased analyzer.

the PR to add sparse tokenizer pipeline: opensearch-project/opensearch-py-ml#459

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: On-deck
Development

Successfully merging a pull request may close this issue.

6 participants