Skip to content

Releases: huggingface/tokenizers

Node v0.8.0

02 Sep 18:12
Compare
Choose a tag to compare

BREACKING CHANGES

  • Many improvements on the Trainer (#519).
    The files must now be provided first when calling tokenizer.train(files, trainer).

Features

  • Adding the TemplateProcessing
  • Add WordLevel and Unigram models (#490)
  • Add nmtNormalizer and precompiledNormalizer normalizers (#490)
  • Add templateProcessing post-processor (#490)
  • Add digitsPreTokenizer pre-tokenizer (#490)
  • Add support for mapping to sequences (#506)
  • Add splitPreTokenizer pre-tokenizer (#542)
  • Add behavior option to the punctuationPreTokenizer (#657)
  • Add the ability to load tokenizers from the Hugging Face Hub using fromPretrained (#780)

Fixes

  • Fix a bug where long tokenizer.json files would be incorrectly deserialized (#459)
  • Fix RobertaProcessing deserialization in PostProcessorWrapper (#464)

Python v0.10.3

24 May 21:31
755e5f5
Compare
Choose a tag to compare

Fixed

  • [#686]: Fix SPM conversion process for whitespace deduplication
  • [#707]: Fix stripping strings containing Unicode characters

Added

  • [#693]: Add a CTC Decoder for Wave2Vec models

Removed

  • [#714]: Removed support for Python 3.5

Python v0.10.2

05 Apr 20:48
Compare
Choose a tag to compare

Fixed

  • [#652]: Fix offsets for Precompiled corner case
  • [#656]: Fix BPE continuing_subword_prefix
  • [#674]: Fix Metaspace serialization problems

Python v0.10.1

04 Feb 15:38
bc8bbf6
Compare
Choose a tag to compare

Fixed

  • [#616]: Fix SentencePiece tokenizers conversion
  • [#617]: Fix offsets produced by Precompiled Normalizer (used by tokenizers converted from SPM)
  • [#618]: Fix Normalizer.normalize with PyNormalizedStringRefMut
  • [#620]: Fix serialization/deserialization for overlapping models
  • [#621]: Fix ByteLevel instantiation from a previously saved state (using __getstate__())

Python v0.10.0

12 Jan 21:36
Compare
Choose a tag to compare

Added

  • [#508]: Add a Visualizer for notebooks to help understand how the tokenizers work
  • [#519]: Add a WordLevelTrainer used to train a WordLevel model
  • [#533]: Add support for conda builds
  • [#542]: Add Split pre-tokenizer to easily split using a pattern
  • [#544]: Ability to train from memory. This also improves the integration with datasets
  • [#590]: Add getters/setters for components on BaseTokenizer
  • [#574]: Add fust_unk option to SentencePieceBPETokenizer

Changed

  • [#509]: Automatically stubbing the .pyi files
  • [#519]: Each Model can return its associated Trainer with get_trainer()
  • [#530]: The various attributes on each component can be get/set (ie.
    tokenizer.model.dropout = 0.1)
  • [#538]: The API Reference has been improved and is now up-to-date.

Fixed

  • [#519]: During training, the Model is now trained in-place. This fixes several bugs that were
    forcing to reload the Model after a training.
  • [#539]: Fix BaseTokenizer enable_truncation docstring

Python v0.10.0rc1

08 Dec 18:32
Compare
Choose a tag to compare
Python v0.10.0rc1 Pre-release
Pre-release

Added

  • [#508]: Add a Visualizer for notebooks to help understand how the tokenizers work
  • [#519]: Add a WordLevelTrainer used to train a WordLevel model
  • [#533]: Add support for conda builds
  • [#542]: Add Split pre-tokenizer to easily split using a pattern
  • [#544]: Ability to train from memory. This also improves the integration with datasets

Changed

  • [#509]: Automatically stubbing the .pyi files
  • [#519]: Each Model can return its associated Trainer with get_trainer()
  • [#530]: The various attributes on each component can be get/set (ie.
    tokenizer.model.dropout = 0.1)
  • [#538]: The API Reference has been improved and is now up-to-date.

Fixed

  • [#519]: During training, the Model is now trained in-place. This fixes several bugs that were
    forcing to reload the Model after a training.
  • [#539]: Fix BaseTokenizer enable_truncation docstring

Python v0.9.4

10 Nov 04:23
b122737
Compare
Choose a tag to compare

Fixed

  • [#492]: Fix from_file on BertWordPieceTokenizer
  • [#498]: Fix the link to download sentencepiece_model_pb2.py
  • [#500]: Fix a typo in the docs quicktour

Changed

  • [#506]: Improve Encoding mappings for pairs of sequence

Python v0.9.3

26 Oct 20:41
Compare
Choose a tag to compare

Fixed

  • [#470]: Fix hanging error when training with custom component
  • [#476]: TemplateProcessing serialization is now deterministic
  • [#481]: Fix SentencePieceBPETokenizer.from_files

Added

  • [#477]: UnicodeScripts PreTokenizer to avoid merges between various scripts
  • [#480]: Unigram now accepts an initial_alphabet and handles special_tokens correctly

Python v0.9.2

15 Oct 14:16
Compare
Choose a tag to compare

Fixed

  • [#464] Fix a problem with RobertaProcessing being deserialized as BertProcessing

Python v0.9.1

13 Oct 19:04
Compare
Choose a tag to compare

Fixed

  • [#459] Fix a problem with deserialization