-
Notifications
You must be signed in to change notification settings - Fork 155
Add BERT Tokenizer as OpenSearch built-in analyzer #3719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: zhichao-aws <[email protected]>
We can merge this PR after 3.0.0-beta1 get released |
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
would you consider option 3 from the issue? #3708 (comment) |
Signed-off-by: zhichao-aws <[email protected]>
I've changed to option 3. |
* Custom Lucene Analyzer that uses the HFModelTokenizer for text analysis. | ||
* Provides a way to process text using Hugging Face models within OpenSearch. | ||
*/ | ||
public class HFModelAnalyzer extends Analyzer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even though this is an internal class, but I am thinking why named it as HFModelAnalyer, it doesn't have to be from huggingface analyzer right? is it can be any custom model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a wrapper class of DJL HuggingFaceTokenizer
. So we need to refect it in the class name
* Supports token weighting and handles overflow scenarios. | ||
*/ | ||
@Log4j2 | ||
public class HFModelTokenizer extends Tokenizer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similarly, is the the tokenizer can be any customMLTokenizer ?
try { | ||
Files.createDirectories(DJLUtils.getMlEngine().getAnalysisRootPath()); | ||
Path filePath = DJLUtils.getMlEngine().getAnalysisRootPath().resolve(name + ZIP_PREFIX); | ||
downloadTokenizerFile(url, filePath, hashCode); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question here, the file path is coded here, so it will download the default tokenizer for bert-uncased, can user change to use other tokenizer instead of this default tokenizer? it looks like it's not allowed at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we only support to load the tokenizers we've released in artifacts.opensearch.org website. Users can not use other tokenizers from custom urls now.
It's an open discussion to support custom files in the future, feel free to add comments in #3708
downloadTokenizerFile(url, filePath, hashCode); | ||
ZipUtils.unzip(filePath.toFile(), DJLUtils.getMlEngine().getAnalysisRootPath().resolve(name)); | ||
} catch (IOException | PrivilegedActionException e) { | ||
log.error("Failed to download analyzer zip file for {}. {}", name, e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when catched the exception, this class doesn't have the artifact for analyzer, what will happen? does it still worth moving forward when there is no artifact? should it throw a different exception and ask users for retries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we don't throw exception here it will throw null pointer exception later when we try to use it but it's not clear. So I think it's better to throw the exception here with clear error message. For retry I think we may want to have some auto retry mechanism in downloadTokenizerFile
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not throw exception, we'll get an error for the analyzer request:
{
"error": {
"root_cause": [
{
"type": "null_pointer_exception",
"reason": "Cannot invoke \"ai.djl.huggingface.tokenizers.HuggingFaceTokenizer.encode(String, boolean, boolean)\" because the return value of \"java.util.function.Supplier.get()\" is null"
}
],
"type": "null_pointer_exception",
"reason": "Cannot invoke \"ai.djl.huggingface.tokenizers.HuggingFaceTokenizer.encode(String, boolean, boolean)\" because the return value of \"java.util.function.Supplier.get()\" is null"
},
"status": 500
}
If throw exception, the caller doesn't catch the exception and the OpenSearch node will fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Break the OpenSearch node is not acceptable. So we can only log errors instead of throw exceptions during initialization.
I've added retry logics, it will be triggered when the analyzer is invoked, and it tries to download file when the tokenizer is not initialized
this.token_weights = DJLUtils | ||
.fetchTokenWeights(DJLUtils.getMlEngine().getAnalysisRootPath().resolve(name).resolve(TOKEN_WEIGHTS_FILE_NAME)); | ||
} catch (Exception e) { | ||
log.error("Failed to initialize analyzer: {}. {}", name, e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, when it couldn't initialize the analyzer, we should throw meaningful exception, maybe
throw new RuntimeException("Failed to initialize bert-uncased tokenizer. " + e);
String hash = calculateFileHash(filePath.toFile()); | ||
if (!hash.equals(hashCode)) { | ||
log.error("Analyzer zip file content hash can't match original hash value."); | ||
throw (new IllegalArgumentException("Analyzer " + name + " zip file content changed.")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use RuntimeException
since the url is hardcoded and it's not something that user can fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
|
||
return action.call(); | ||
} finally { | ||
Thread.currentThread().setContextClassLoader(contextClassLoader); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also need to reset the system properties?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logics are consistent with other place using djl. We can keep them even finished.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3719 +/- ##
============================================
+ Coverage 79.11% 79.14% +0.02%
- Complexity 7116 7147 +31
============================================
Files 623 628 +5
Lines 31590 31706 +116
Branches 3574 3583 +9
============================================
+ Hits 24993 25093 +100
- Misses 5057 5068 +11
- Partials 1540 1545 +5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Description
This PR add bert-base-uncased tokenizer and bert-base-multilingual-uncased tokenizer as OpenSearch built-in analyzer/tokenizer. Users can use them via analyze API without doing any special settings:
We also have a follow up PR in neural-search to make neural-sparse search work with the analyzer.
Related Issues
Resolves #3708
Check List
--signoff
.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.