Skip to content

Add BERT Tokenizer as OpenSearch built-in analyzer #3719

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Apr 21, 2025

Conversation

zhichao-aws
Copy link
Member

@zhichao-aws zhichao-aws commented Apr 9, 2025

Description

This PR add bert-base-uncased tokenizer and bert-base-multilingual-uncased tokenizer as OpenSearch built-in analyzer/tokenizer. Users can use them via analyze API without doing any special settings:

POST /_analyze
{
    "tokenizer" : "mbert-uncased",
    "text" : "hello world 你好"
}

We also have a follow up PR in neural-search to make neural-sparse search work with the analyzer.

Related Issues

Resolves #3708

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: zhichao-aws <[email protected]>
@zhichao-aws zhichao-aws had a problem deploying to ml-commons-cicd-env-require-approval April 9, 2025 07:41 — with GitHub Actions Failure
@zhichao-aws zhichao-aws had a problem deploying to ml-commons-cicd-env-require-approval April 9, 2025 07:41 — with GitHub Actions Failure
@zhichao-aws zhichao-aws had a problem deploying to ml-commons-cicd-env-require-approval April 9, 2025 07:41 — with GitHub Actions Failure
@zhichao-aws zhichao-aws had a problem deploying to ml-commons-cicd-env-require-approval April 9, 2025 07:41 — with GitHub Actions Failure
@zhichao-aws
Copy link
Member Author

We can merge this PR after 3.0.0-beta1 get released

Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
@zhichao-aws zhichao-aws had a problem deploying to ml-commons-cicd-env-require-approval April 9, 2025 09:24 — with GitHub Actions Failure
@zhichao-aws zhichao-aws had a problem deploying to ml-commons-cicd-env-require-approval April 9, 2025 09:24 — with GitHub Actions Failure
@zhichao-aws zhichao-aws had a problem deploying to ml-commons-cicd-env-require-approval April 9, 2025 09:24 — with GitHub Actions Failure
@zhichao-aws zhichao-aws had a problem deploying to ml-commons-cicd-env-require-approval April 9, 2025 09:24 — with GitHub Actions Failure
Signed-off-by: zhichao-aws <[email protected]>
@zhichao-aws zhichao-aws had a problem deploying to ml-commons-cicd-env-require-approval April 14, 2025 07:39 — with GitHub Actions Error
@zhichao-aws zhichao-aws had a problem deploying to ml-commons-cicd-env-require-approval April 14, 2025 07:39 — with GitHub Actions Error
@zhichao-aws zhichao-aws had a problem deploying to ml-commons-cicd-env-require-approval April 14, 2025 07:39 — with GitHub Actions Failure
@zhichao-aws zhichao-aws had a problem deploying to ml-commons-cicd-env-require-approval April 14, 2025 07:39 — with GitHub Actions Failure
@mingshl
Copy link
Collaborator

mingshl commented Apr 14, 2025

would you consider option 3 from the issue? #3708 (comment)

Signed-off-by: zhichao-aws <[email protected]>
@zhichao-aws zhichao-aws temporarily deployed to ml-commons-cicd-env-require-approval April 18, 2025 06:45 — with GitHub Actions Inactive
@zhichao-aws zhichao-aws temporarily deployed to ml-commons-cicd-env-require-approval April 18, 2025 06:45 — with GitHub Actions Inactive
@zhichao-aws zhichao-aws temporarily deployed to ml-commons-cicd-env-require-approval April 18, 2025 06:45 — with GitHub Actions Inactive
@zhichao-aws zhichao-aws temporarily deployed to ml-commons-cicd-env-require-approval April 18, 2025 06:45 — with GitHub Actions Inactive
Signed-off-by: zhichao-aws <[email protected]>
@zhichao-aws zhichao-aws temporarily deployed to ml-commons-cicd-env-require-approval April 18, 2025 10:46 — with GitHub Actions Inactive
@zhichao-aws zhichao-aws temporarily deployed to ml-commons-cicd-env-require-approval April 18, 2025 10:46 — with GitHub Actions Inactive
@zhichao-aws zhichao-aws temporarily deployed to ml-commons-cicd-env-require-approval April 18, 2025 10:46 — with GitHub Actions Inactive
@zhichao-aws zhichao-aws temporarily deployed to ml-commons-cicd-env-require-approval April 18, 2025 10:46 — with GitHub Actions Inactive
@zhichao-aws zhichao-aws temporarily deployed to ml-commons-cicd-env-require-approval April 18, 2025 17:45 — with GitHub Actions Inactive
@zhichao-aws zhichao-aws temporarily deployed to ml-commons-cicd-env-require-approval April 18, 2025 17:45 — with GitHub Actions Inactive
@xinyual
Copy link
Collaborator

xinyual commented Apr 21, 2025

LGTM!

@zane-neo zane-neo merged commit 088c1a5 into opensearch-project:main Apr 21, 2025
13 checks passed
@mingshl
Copy link
Collaborator

mingshl commented Apr 22, 2025

@zane-neo @xinyual @zhichao-aws we are already code freeze for any new feature PRs. This PR passed the code freeze deadline, would you work with release team to get an exception? or please revert this PR

@zhichao-aws
Copy link
Member Author

Checked with @peterzhuamazon . We can still push feature code to main, as long as not backport to 3.0 branch

@mingshl
Copy link
Collaborator

mingshl commented Apr 23, 2025

Checked with @peterzhuamazon . We can still push feature code to main, as long as not backport to 3.0 branch

thanks for double checking.

opensearch-trigger-bot bot pushed a commit that referenced this pull request Apr 29, 2025
* bert analyzer

Signed-off-by: zhichao-aws <[email protected]>

* add license header

Signed-off-by: zhichao-aws <[email protected]>

* add rest test case

Signed-off-by: zhichao-aws <[email protected]>

* load from zip

Signed-off-by: zhichao-aws <[email protected]>

* address comments

Signed-off-by: zhichao-aws <[email protected]>

* retry for init

Signed-off-by: zhichao-aws <[email protected]>

---------

Signed-off-by: zhichao-aws <[email protected]>
(cherry picked from commit 088c1a5)
akolarkunnu pushed a commit to akolarkunnu/ml-commons that referenced this pull request Jun 6, 2025
…ct#3719)

* bert analyzer

Signed-off-by: zhichao-aws <[email protected]>

* add license header

Signed-off-by: zhichao-aws <[email protected]>

* add rest test case

Signed-off-by: zhichao-aws <[email protected]>

* load from zip

Signed-off-by: zhichao-aws <[email protected]>

* address comments

Signed-off-by: zhichao-aws <[email protected]>

* retry for init

Signed-off-by: zhichao-aws <[email protected]>

---------

Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: Abdul Muneer Kolarkunnu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC] Add BERT Tokenizer as OpenSearch built-in analyzer in ml-commons
6 participants