Skip to content

[RFC] Add BERT Tokenizer as OpenSearch built-in analyzer in ml-commons #3708

Closed
@zhichao-aws

Description

@zhichao-aws

What/Why

What problems are you trying to solve?

We're implementing analyzer-based neural sparse query support in the neural-search plugin, which requires BERT tokenizer functionality. (RFC link).

The BERT tokenizer is widely used by OpenSearch pretrained text embedding models. Customers can use the analyzer to debug their input. Besides, the chunking processor also depends on analyzers to count token numbers.

However, we've encountered an issue that native libraries (i.e. DJL) cannot be loaded by two different classloaders. Since ml-commons already has dependencies on DJL and handles native library loading, we need to implement the BERT tokenizer functionality within ml-commons to avoid native library loading conflicts.

What are you proposing?

Build bert-base-uncased tokenizer and google-bert/bert-base-multilingual-uncased as OpenSearch built in analyzers in ml-common. This will include:

The implementation will leverage DJL's HuggingFaceTokenizer, which is already a dependency in ml-commons. This will serve as the foundation for analyzer-based neural sparse queries in the neural-search plugin.

Implementations

The main ambiguity is how we load the tokenizer config file (example) and the idf file (saves the query token weight for neural sparse. example). Here are several options:

Option 1: Load from java resources dir

pros: no remote connect, most efficient
cons: increase the size for ml-commons jar file

Option 2: Fetch from Hugging Face

pros: easy to implement, better extensibility for custom tokenizer
cons: security challenges if there is mallicious tokenizer.config at HF; can not work if remote website fails

Option 3: Fetch from OpenSearch artifacts website

Fetch from https://artifacts.opensearch.org/ , the model file is identical with sparse tokenize files.

pros: don't increase jar size, the file is managed by opensearch
cons: need to hard code the map of analyzer name and model artifact URL at some place.

Note for all options, we can build it in a lazy load fashion. That is we don't load one tokenizer multiple times, and only load the tokenizer when it's really invoked.

Open discussion

We can also make the HF analyzer configurable by providing HF model id. But this need to be further reviewed by security teams. The custom analyzer is not in the scope now, but we do want to collect feedbacks about this.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions