-
Notifications
You must be signed in to change notification settings - Fork 992
Open
Description
Bug description:
The normalize method in BaseTokenizer incorrectly calls self._tokenizer.normalize() which doesn't exist. The Tokenizer object only has a normalizer attribute, not a normalize method.
from tokenizers import SentencePieceUnigramTokenizer
tokenizer = SentencePieceUnigramTokenizer()
tokenizer.train_from_iterator(["hello world"], vocab_size=100)
try:
normalized = tokenizer.normalize("hello world")
print("Success:", normalized)
except AttributeError as e:
print("BUG REPRODUCED:", e)Raises AttributeError: 'tokenizers.Tokenizer' object has no attribute 'normalize'
In base_tokenizer.py, line 190:
def normalize(self, sequence: str) -> str:
return self._tokenizer.normalize(sequence) # BUG: 'normalize' method doesn't existShould be:
def normalize(self, sequence: str) -> str:
return self._tokenizer.normalizer.normalize(sequence) # FIX: use normalizer attributetokenizers version: 0.22.1
Operating systems tested on: Linux
Metadata
Metadata
Assignees
Labels
No labels