Releases: huggingface/tokenizers
Releases · huggingface/tokenizers
Node v0.8.0
BREACKING CHANGES
- Many improvements on the Trainer (#519).
The files must now be provided first when callingtokenizer.train(files, trainer)
.
Features
- Adding the
TemplateProcessing
- Add
WordLevel
andUnigram
models (#490) - Add
nmtNormalizer
andprecompiledNormalizer
normalizers (#490) - Add
templateProcessing
post-processor (#490) - Add
digitsPreTokenizer
pre-tokenizer (#490) - Add support for mapping to sequences (#506)
- Add
splitPreTokenizer
pre-tokenizer (#542) - Add
behavior
option to thepunctuationPreTokenizer
(#657) - Add the ability to load tokenizers from the Hugging Face Hub using
fromPretrained
(#780)
Fixes
Python v0.10.3
Python v0.10.2
Python v0.10.1
Fixed
- [#616]: Fix SentencePiece tokenizers conversion
- [#617]: Fix offsets produced by Precompiled Normalizer (used by tokenizers converted from SPM)
- [#618]: Fix Normalizer.normalize with
PyNormalizedStringRefMut
- [#620]: Fix serialization/deserialization for overlapping models
- [#621]: Fix
ByteLevel
instantiation from a previously saved state (using__getstate__()
)
Python v0.10.0
Added
- [#508]: Add a Visualizer for notebooks to help understand how the tokenizers work
- [#519]: Add a
WordLevelTrainer
used to train aWordLevel
model - [#533]: Add support for conda builds
- [#542]: Add Split pre-tokenizer to easily split using a pattern
- [#544]: Ability to train from memory. This also improves the integration with
datasets
- [#590]: Add getters/setters for components on BaseTokenizer
- [#574]: Add
fust_unk
option to SentencePieceBPETokenizer
Changed
- [#509]: Automatically stubbing the
.pyi
files - [#519]: Each
Model
can return its associatedTrainer
withget_trainer()
- [#530]: The various attributes on each component can be get/set (ie.
tokenizer.model.dropout = 0.1
) - [#538]: The API Reference has been improved and is now up-to-date.
Fixed
Python v0.10.0rc1
Added
- [#508]: Add a Visualizer for notebooks to help understand how the tokenizers work
- [#519]: Add a
WordLevelTrainer
used to train aWordLevel
model - [#533]: Add support for conda builds
- [#542]: Add Split pre-tokenizer to easily split using a pattern
- [#544]: Ability to train from memory. This also improves the integration with
datasets
Changed
- [#509]: Automatically stubbing the
.pyi
files - [#519]: Each
Model
can return its associatedTrainer
withget_trainer()
- [#530]: The various attributes on each component can be get/set (ie.
tokenizer.model.dropout = 0.1
) - [#538]: The API Reference has been improved and is now up-to-date.
Fixed
Python v0.9.4
Python v0.9.3
Python v0.9.2
Fixed
- [#464] Fix a problem with RobertaProcessing being deserialized as BertProcessing
Python v0.9.1
Fixed
- [#459] Fix a problem with deserialization