Skip to content

Add BERT Tokenizer as OpenSearch built-in analyzer #3719

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

zhichao-aws
Copy link
Member

@zhichao-aws zhichao-aws commented Apr 9, 2025

Description

This PR add bert-base-uncased tokenizer and bert-base-multilingual-uncased tokenizer as OpenSearch built-in analyzer/tokenizer. Users can use them via analyze API without doing any special settings:

POST /_analyze
{
    "tokenizer" : "mbert-uncased",
    "text" : "hello world 你好"
}

We also have a follow up PR in neural-search to make neural-sparse search work with the analyzer.

Related Issues

Resolves #3708

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: zhichao-aws <[email protected]>
@zhichao-aws zhichao-aws requested a deployment to ml-commons-cicd-env-require-approval April 9, 2025 07:41 — with GitHub Actions Waiting
@zhichao-aws zhichao-aws requested a deployment to ml-commons-cicd-env-require-approval April 9, 2025 07:41 — with GitHub Actions Waiting
@zhichao-aws zhichao-aws requested a deployment to ml-commons-cicd-env-require-approval April 9, 2025 07:41 — with GitHub Actions Waiting
@zhichao-aws zhichao-aws requested a deployment to ml-commons-cicd-env-require-approval April 9, 2025 07:41 — with GitHub Actions Waiting
@zhichao-aws
Copy link
Member Author

We can merge this PR after 3.0.0-beta1 get released

Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
@zhichao-aws zhichao-aws requested a deployment to ml-commons-cicd-env-require-approval April 9, 2025 09:24 — with GitHub Actions Waiting
@zhichao-aws zhichao-aws requested a deployment to ml-commons-cicd-env-require-approval April 9, 2025 09:24 — with GitHub Actions Waiting
@zhichao-aws zhichao-aws requested a deployment to ml-commons-cicd-env-require-approval April 9, 2025 09:24 — with GitHub Actions Waiting
@zhichao-aws zhichao-aws requested a deployment to ml-commons-cicd-env-require-approval April 9, 2025 09:24 — with GitHub Actions Waiting
Signed-off-by: zhichao-aws <[email protected]>
@zhichao-aws zhichao-aws had a problem deploying to ml-commons-cicd-env-require-approval April 14, 2025 07:39 — with GitHub Actions Error
@zhichao-aws zhichao-aws had a problem deploying to ml-commons-cicd-env-require-approval April 14, 2025 07:39 — with GitHub Actions Error
@zhichao-aws zhichao-aws had a problem deploying to ml-commons-cicd-env-require-approval April 14, 2025 07:39 — with GitHub Actions Failure
@zhichao-aws zhichao-aws had a problem deploying to ml-commons-cicd-env-require-approval April 14, 2025 07:39 — with GitHub Actions Failure
@mingshl
Copy link
Collaborator

mingshl commented Apr 14, 2025

would you consider option 3 from the issue? #3708 (comment)

@zhichao-aws zhichao-aws requested a deployment to ml-commons-cicd-env-require-approval April 15, 2025 02:28 — with GitHub Actions Waiting
@zhichao-aws zhichao-aws requested a deployment to ml-commons-cicd-env-require-approval April 15, 2025 02:28 — with GitHub Actions Waiting
@zhichao-aws zhichao-aws requested a deployment to ml-commons-cicd-env-require-approval April 15, 2025 02:28 — with GitHub Actions Waiting
@zhichao-aws zhichao-aws requested a deployment to ml-commons-cicd-env-require-approval April 15, 2025 02:28 — with GitHub Actions Waiting
Signed-off-by: zhichao-aws <[email protected]>
@zhichao-aws
Copy link
Member Author

would you consider option 3 from the issue? #3708 (comment)

I've changed to option 3.

@zhichao-aws zhichao-aws temporarily deployed to ml-commons-cicd-env-require-approval April 15, 2025 08:24 — with GitHub Actions Inactive
@zhichao-aws zhichao-aws temporarily deployed to ml-commons-cicd-env-require-approval April 15, 2025 08:24 — with GitHub Actions Inactive
@zhichao-aws zhichao-aws temporarily deployed to ml-commons-cicd-env-require-approval April 15, 2025 08:24 — with GitHub Actions Inactive
@zhichao-aws zhichao-aws temporarily deployed to ml-commons-cicd-env-require-approval April 15, 2025 08:24 — with GitHub Actions Inactive
* Custom Lucene Analyzer that uses the HFModelTokenizer for text analysis.
* Provides a way to process text using Hugging Face models within OpenSearch.
*/
public class HFModelAnalyzer extends Analyzer {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even though this is an internal class, but I am thinking why named it as HFModelAnalyer, it doesn't have to be from huggingface analyzer right? is it can be any custom model?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a wrapper class of DJL HuggingFaceTokenizer. So we need to refect it in the class name

* Supports token weighting and handles overflow scenarios.
*/
@Log4j2
public class HFModelTokenizer extends Tokenizer {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly, is the the tokenizer can be any customMLTokenizer ?

try {
Files.createDirectories(DJLUtils.getMlEngine().getAnalysisRootPath());
Path filePath = DJLUtils.getMlEngine().getAnalysisRootPath().resolve(name + ZIP_PREFIX);
downloadTokenizerFile(url, filePath, hashCode);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question here, the file path is coded here, so it will download the default tokenizer for bert-uncased, can user change to use other tokenizer instead of this default tokenizer? it looks like it's not allowed at the moment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we only support to load the tokenizers we've released in artifacts.opensearch.org website. Users can not use other tokenizers from custom urls now.

It's an open discussion to support custom files in the future, feel free to add comments in #3708

downloadTokenizerFile(url, filePath, hashCode);
ZipUtils.unzip(filePath.toFile(), DJLUtils.getMlEngine().getAnalysisRootPath().resolve(name));
} catch (IOException | PrivilegedActionException e) {
log.error("Failed to download analyzer zip file for {}. {}", name, e);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when catched the exception, this class doesn't have the artifact for analyzer, what will happen? does it still worth moving forward when there is no artifact? should it throw a different exception and ask users for retries?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we don't throw exception here it will throw null pointer exception later when we try to use it but it's not clear. So I think it's better to throw the exception here with clear error message. For retry I think we may want to have some auto retry mechanism in downloadTokenizerFile.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not throw exception, we'll get an error for the analyzer request:

{
    "error": {
        "root_cause": [
            {
                "type": "null_pointer_exception",
                "reason": "Cannot invoke \"ai.djl.huggingface.tokenizers.HuggingFaceTokenizer.encode(String, boolean, boolean)\" because the return value of \"java.util.function.Supplier.get()\" is null"
            }
        ],
        "type": "null_pointer_exception",
        "reason": "Cannot invoke \"ai.djl.huggingface.tokenizers.HuggingFaceTokenizer.encode(String, boolean, boolean)\" because the return value of \"java.util.function.Supplier.get()\" is null"
    },
    "status": 500
}

If throw exception, the caller doesn't catch the exception and the OpenSearch node will fail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Break the OpenSearch node is not acceptable. So we can only log errors instead of throw exceptions during initialization.

I've added retry logics, it will be triggered when the analyzer is invoked, and it tries to download file when the tokenizer is not initialized

this.token_weights = DJLUtils
.fetchTokenWeights(DJLUtils.getMlEngine().getAnalysisRootPath().resolve(name).resolve(TOKEN_WEIGHTS_FILE_NAME));
} catch (Exception e) {
log.error("Failed to initialize analyzer: {}. {}", name, e);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, when it couldn't initialize the analyzer, we should throw meaningful exception, maybe
throw new RuntimeException("Failed to initialize bert-uncased tokenizer. " + e);

String hash = calculateFileHash(filePath.toFile());
if (!hash.equals(hashCode)) {
log.error("Analyzer zip file content hash can't match original hash value.");
throw (new IllegalArgumentException("Analyzer " + name + " zip file content changed."));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use RuntimeException since the url is hardcoded and it's not something that user can fix?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack


return action.call();
} finally {
Thread.currentThread().setContextClassLoader(contextClassLoader);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need to reset the system properties?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logics are consistent with other place using djl. We can keep them even finished.

Copy link

codecov bot commented Apr 15, 2025

Codecov Report

Attention: Patch coverage is 87.39496% with 15 lines in your changes missing coverage. Please review.

Project coverage is 79.14%. Comparing base (2ef82b6) to head (b7b698f).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
...ch/ml/engine/analysis/HFModelTokenizerFactory.java 75.75% 7 Missing and 1 partial ⚠️
...pensearch/ml/engine/analysis/HFModelTokenizer.java 89.13% 2 Missing and 3 partials ⚠️
...gine/algorithms/tokenize/SparseTokenizerModel.java 0.00% 1 Missing ⚠️
...va/org/opensearch/ml/engine/analysis/DJLUtils.java 93.75% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3719      +/-   ##
============================================
+ Coverage     79.11%   79.14%   +0.02%     
- Complexity     7116     7147      +31     
============================================
  Files           623      628       +5     
  Lines         31590    31706     +116     
  Branches       3574     3583       +9     
============================================
+ Hits          24993    25093     +100     
- Misses         5057     5068      +11     
- Partials       1540     1545       +5     
Flag Coverage Δ
ml-commons 79.14% <87.39%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@zhichao-aws zhichao-aws had a problem deploying to ml-commons-cicd-env-require-approval April 15, 2025 17:31 — with GitHub Actions Failure
@zhichao-aws zhichao-aws had a problem deploying to ml-commons-cicd-env-require-approval April 15, 2025 17:31 — with GitHub Actions Error
Signed-off-by: zhichao-aws <[email protected]>
@zhichao-aws zhichao-aws requested a deployment to ml-commons-cicd-env-require-approval April 16, 2025 08:22 — with GitHub Actions Waiting
@zhichao-aws zhichao-aws requested a deployment to ml-commons-cicd-env-require-approval April 16, 2025 08:22 — with GitHub Actions Waiting
@zhichao-aws zhichao-aws requested a deployment to ml-commons-cicd-env-require-approval April 16, 2025 08:22 — with GitHub Actions Waiting
@zhichao-aws zhichao-aws requested a deployment to ml-commons-cicd-env-require-approval April 16, 2025 08:22 — with GitHub Actions Waiting
Signed-off-by: zhichao-aws <[email protected]>
@zhichao-aws zhichao-aws requested a deployment to ml-commons-cicd-env-require-approval April 16, 2025 08:27 — with GitHub Actions Waiting
@zhichao-aws zhichao-aws requested a deployment to ml-commons-cicd-env-require-approval April 16, 2025 08:27 — with GitHub Actions Waiting
@zhichao-aws zhichao-aws requested a deployment to ml-commons-cicd-env-require-approval April 16, 2025 08:27 — with GitHub Actions Waiting
@zhichao-aws zhichao-aws requested a deployment to ml-commons-cicd-env-require-approval April 16, 2025 08:27 — with GitHub Actions Waiting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC] Add BERT Tokenizer as OpenSearch built-in analyzer in ml-commons
3 participants