[RFC] Add BERT Tokenizer as OpenSearch built-in analyzer in ml-commons

## What/Why
### What problems are you trying to solve?
We're implementing analyzer-based neural sparse query support in the neural-search plugin, which requires BERT tokenizer functionality. ([RFC link](https://github.com/opensearch-project/neural-search/pull/1088)). 

The BERT tokenizer is widely used by OpenSearch pretrained text embedding models. Customers can use the analyzer to debug their input. Besides, the chunking processor also depends on analyzers to count token numbers.

However, we've encountered an issue that **native libraries (i.e. DJL) cannot be loaded by two different classloaders.** Since ml-commons already has dependencies on DJL and handles native library loading, we need to implement the BERT tokenizer functionality within ml-commons to avoid native library loading conflicts.

### What are you proposing?
Build `bert-base-uncased` tokenizer and `google-bert/bert-base-multilingual-uncased` as OpenSearch built in analyzers in ml-common. This will include:

The implementation will leverage DJL's HuggingFaceTokenizer, which is already a dependency in ml-commons. This will serve as the foundation for analyzer-based neural sparse queries in the neural-search plugin.

## Implementations

The main ambiguity is how we load the tokenizer config file ([example](https://huggingface.co/google-bert/bert-base-multilingual-uncased/blob/main/tokenizer.json)) and the idf file (saves the query token weight for neural sparse. [example](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill/blob/main/idf.json)). Here are several options:

### Option 1: Load from java resources dir

pros: no remote connect, most efficient
cons: increase the size for ml-commons jar file

### Option 2: Fetch from Hugging Face

pros: easy to implement, better extensibility for custom tokenizer
cons: security challenges if there is mallicious tokenizer.config at HF; can not work if remote website fails

### Option 3: Fetch from OpenSearch artifacts website

Fetch from https://artifacts.opensearch.org/ , the model file is identical with sparse tokenize files.

pros: don't increase jar size, the file is managed by opensearch
cons:  need to hard code the map of analyzer name and model artifact URL at some place.

Note for all options, we can build it in a lazy load fashion. That is we don't load one tokenizer multiple times, and only load the tokenizer when it's really invoked.


## Open discussion

We can also make the HF analyzer configurable by providing HF model id. But this need to be further reviewed by security teams. The custom analyzer is not in the scope now, but we do want to collect feedbacks about this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Add BERT Tokenizer as OpenSearch built-in analyzer in ml-commons #3708

What/Why

What problems are you trying to solve?

What are you proposing?

Implementations

Option 1: Load from java resources dir

Option 2: Fetch from Hugging Face

Option 3: Fetch from OpenSearch artifacts website

Open discussion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Add BERT Tokenizer as OpenSearch built-in analyzer in ml-commons #3708

Description

What/Why

What problems are you trying to solve?

What are you proposing?

Implementations

Option 1: Load from java resources dir

Option 2: Fetch from Hugging Face

Option 3: Fetch from OpenSearch artifacts website

Open discussion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions