Skip to content

Navigation Menu

Explore
By company size
By use case
By industry
View all solutions
Topics
- AI
- DevOps
- Security
- Software Development
- View all
Explore
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
Repositories
- Enterprise platform
  AI-powered developer platform
Available add-ons
Pricing

Search code, repositories, users, issues, pull requests...

Search

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

huggingface / tokenizers Public

Notifications You must be signed in to change notification settings
Fork 847
Star 9.4k

Code
Issues 63
Pull requests 13
Actions
Projects
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Actions
Projects
Security
Insights

Releases: huggingface/tokenizers

Releases · huggingface/tokenizers

Node v0.8.0

02 Sep 18:12

n1t0

Compare

Choose a tag to compare

Loading

Node v0.8.0

BREACKING CHANGES

Many improvements on the Trainer (#519).
The files must now be provided first when calling tokenizer.train(files, trainer).

Features

Adding the TemplateProcessing
Add WordLevel and Unigram models (#490)
Add nmtNormalizer and precompiledNormalizer normalizers (#490)
Add templateProcessing post-processor (#490)
Add digitsPreTokenizer pre-tokenizer (#490)
Add support for mapping to sequences (#506)
Add splitPreTokenizer pre-tokenizer (#542)
Add behavior option to the punctuationPreTokenizer (#657)
Add the ability to load tokenizers from the Hugging Face Hub using fromPretrained (#780)

Fixes

Fix a bug where long tokenizer.json files would be incorrectly deserialized (#459)
Fix RobertaProcessing deserialization in PostProcessorWrapper (#464)

Assets 2

Loading

All reactions

Python v0.10.3

24 May 21:31

n1t0

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

Python v0.10.3

Fixed

[#686]: Fix SPM conversion process for whitespace deduplication
[#707]: Fix stripping strings containing Unicode characters

Added

[#693]: Add a CTC Decoder for Wave2Vec models

Removed

[#714]: Removed support for Python 3.5

Assets 2

Loading

All reactions

Python v0.10.2

05 Apr 20:48

n1t0

Compare

Choose a tag to compare

Loading

Python v0.10.2

Fixed

[#652]: Fix offsets for Precompiled corner case
[#656]: Fix BPE continuing_subword_prefix
[#674]: Fix Metaspace serialization problems

Assets 2

Loading

All reactions

Python v0.10.1

04 Feb 15:38

n1t0

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

Python v0.10.1

Fixed

[#616]: Fix SentencePiece tokenizers conversion
[#617]: Fix offsets produced by Precompiled Normalizer (used by tokenizers converted from SPM)
[#618]: Fix Normalizer.normalize with PyNormalizedStringRefMut
[#620]: Fix serialization/deserialization for overlapping models
[#621]: Fix ByteLevel instantiation from a previously saved state (using __getstate__())

Assets 2

Loading

All reactions

Python v0.10.0

12 Jan 21:36

n1t0

Compare

Choose a tag to compare

Loading

Python v0.10.0

Added

[#508]: Add a Visualizer for notebooks to help understand how the tokenizers work
[#519]: Add a WordLevelTrainer used to train a WordLevel model
[#533]: Add support for conda builds
[#542]: Add Split pre-tokenizer to easily split using a pattern
[#544]: Ability to train from memory. This also improves the integration with datasets
[#590]: Add getters/setters for components on BaseTokenizer
[#574]: Add fust_unk option to SentencePieceBPETokenizer

Changed

[#509]: Automatically stubbing the .pyi files
[#519]: Each Model can return its associated Trainer with get_trainer()
[#530]: The various attributes on each component can be get/set (ie.
tokenizer.model.dropout = 0.1)
[#538]: The API Reference has been improved and is now up-to-date.

Fixed

[#519]: During training, the Model is now trained in-place. This fixes several bugs that were
forcing to reload the Model after a training.
[#539]: Fix BaseTokenizer enable_truncation docstring

Assets 2

Loading

All reactions

Python v0.10.0rc1

08 Dec 18:32

n1t0

python-v0.10.0rc1

Compare

Choose a tag to compare

Loading

Python v0.10.0rc1 Pre-release

Pre-release

Added

[#508]: Add a Visualizer for notebooks to help understand how the tokenizers work
[#519]: Add a WordLevelTrainer used to train a WordLevel model
[#533]: Add support for conda builds
[#542]: Add Split pre-tokenizer to easily split using a pattern
[#544]: Ability to train from memory. This also improves the integration with datasets

Changed

[#509]: Automatically stubbing the .pyi files
[#519]: Each Model can return its associated Trainer with get_trainer()
[#530]: The various attributes on each component can be get/set (ie.
tokenizer.model.dropout = 0.1)
[#538]: The API Reference has been improved and is now up-to-date.

Fixed

[#519]: During training, the Model is now trained in-place. This fixes several bugs that were
forcing to reload the Model after a training.
[#539]: Fix BaseTokenizer enable_truncation docstring

Assets 2

Loading

All reactions

Python v0.9.4

10 Nov 04:23

n1t0

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

Python v0.9.4

Fixed

[#492]: Fix from_file on BertWordPieceTokenizer
[#498]: Fix the link to download sentencepiece_model_pb2.py
[#500]: Fix a typo in the docs quicktour

Changed

[#506]: Improve Encoding mappings for pairs of sequence

Assets 2

Loading

All reactions

Python v0.9.3

26 Oct 20:41

n1t0

Compare

Choose a tag to compare

Loading

Python v0.9.3

Fixed

[#470]: Fix hanging error when training with custom component
[#476]: TemplateProcessing serialization is now deterministic
[#481]: Fix SentencePieceBPETokenizer.from_files

Added

[#477]: UnicodeScripts PreTokenizer to avoid merges between various scripts
[#480]: Unigram now accepts an initial_alphabet and handles special_tokens correctly

Assets 2

Loading

All reactions

Python v0.9.2

15 Oct 14:16

n1t0

Compare

Choose a tag to compare

Loading

Python v0.9.2

Fixed

[#464] Fix a problem with RobertaProcessing being deserialized as BertProcessing

Assets 2

Loading

All reactions

Python v0.9.1

13 Oct 19:04

n1t0

Compare

Choose a tag to compare

Loading

Python v0.9.1

Fixed

[#459] Fix a problem with deserialization

Assets 2

Loading

All reactions

Previous 1 2 3 4 5 6 7 8 9 10 Next

Footer

© 2025 GitHub, Inc.

Footer navigation

Terms
Privacy
Security
Status
Docs
Contact

You can’t perform that action at this time.