-
Notifications
You must be signed in to change notification settings - Fork 27k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BertTokenizer and BertTokenizerFast have different behavior when requested "return_overflowing_tokens" #28900
Comments
Hey! Thanks for opening this issue. Would you like to dive in this and open a PR for a fix? It might be a known bug + overflowing tokens are not supported on all slow tokenizer. The fast is probably right behaviour |
I don't know what is the correct behaviour. You can get the overflowing tokens from both tokenizers. It's just that the returned data structure needs to be more consistent. I prefer the fast tokenizers behaviour, but the BatchEncoding returns None for the overflowing_tokens and is inconsistent with the advertised API in reference help. |
@ArthurZucker @amyeroberts im interested in taking up this issue, Just wanted to confirm something else as well, shouldnt the behavior of AutoTokenizer match the specific tokenizer? Eg I tried this
Outputs(much different from using the BertTokenizer shown by nikola above) |
Yes, fast and slow tokenizers are suppose to give a similar output (not the same format but all the overflow etc should) |
Hi @ArthurZucker / @amyeroberts from transformers import BertTokenizer, BertTokenizerFast, BatchEncoding, AutoTokenizer
n_tok = BertTokenizer.from_pretrained("bert-base-uncased")
f_tok = BertTokenizerFast.from_pretrained("bert-base-uncased")
text = "hey this is jino, im just reading the api dont mind me"
n_inputs: BatchEncoding = n_tok(text=text, add_special_tokens=True, max_length=6, truncation=True, padding='max_length', return_overflowing_tokens=True)
o = n_inputs.get("overflowing_tokens")
print(f'Overflowing {o}')
print(n_inputs)` The following is the output, |
In an optimal word, we want the slow to match the fast! I am not ceratin in this specific case which is "expected" or not 😅 |
Hi @ArthurZucker / @amyeroberts,
|
System Info
transformers
version: 4.37.2Who can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
For the
n_inputs['input_ids']
we get[101, 7592, 2026, 2171, 2003, 102]
, andfor the
f_inputs['input_ids']
we get[[101, 7592, 2026, 2171, 2003, 102], [101, 24794, 1998, 1045, 2139, 102], [101, 8569, 2290, 19081, 2085, 102]]
.Outputs should be the same.
The text was updated successfully, but these errors were encountered: