Skip to content

Use common base class private functions for TikToken #45

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 15, 2025

Conversation

@larryliu0820 larryliu0820 merged commit 6a6e24f into pytorch-labs:main Apr 15, 2025
4 checks passed
facebook-github-bot pushed a commit that referenced this pull request Apr 18, 2025
Summary:
🧱 Stack:
- [ ] #45
- [ ] #48
- [x] #49
- [ ] #50

### Testing
Pass CI
```
cmake -DTOKENIZERS_BUILD_TEST=ON -DCMAKE_BUILD_TYPE=Debug . -Bbuild && cmake --build build -j9 --config Debug
(cd build && ctest)
```


Differential Revision: D73238728

Pulled By: jackzhxng
facebook-github-bot pushed a commit that referenced this pull request Apr 18, 2025
Summary:
🧱 Stack:
- [ ] #45
- [ ] #48
- [x] #49
- [ ] #50

### Testing
Pass CI
```
cmake -DTOKENIZERS_BUILD_TEST=ON -DCMAKE_BUILD_TYPE=Debug . -Bbuild && cmake --build build -j9 --config Debug
(cd build && ctest)
```


Differential Revision: D73238728

Pulled By: jackzhxng
jackzhxng added a commit that referenced this pull request Apr 21, 2025
Summary:
Adds pcre2 to handle the negative lookbehinds in HuggingFace tokenizers.

Performance stays about the same from test runs [before](https://github.com/pytorch-labs/tokenizers/actions/runs/14480863330/job/40617329721#step:14:758) (run on last commit on main) and [after](https://github.com/pytorch-labs/tokenizers/actions/runs/14526152504/job/40757962551#step:14:901) (this pr).

Tokenizer library size (from `ls -lh build/libtokenizers.a`): `13M` (on main) -> `15M`. This most likely comes from adding the `pcre2` lib.

🧱 Stack:
- [ ] #45
- [ ] #48
- [ ] #49
- [x] #50

Pull Request resolved: #50

Differential Revision: D73295314

Pulled By: jackzhxng
jackzhxng added a commit that referenced this pull request Apr 21, 2025
Summary:
Adds pcre2 to handle the negative lookbehinds in HuggingFace tokenizers.

Performance stays about the same from test runs [before](https://github.com/pytorch-labs/tokenizers/actions/runs/14480863330/job/40617329721#step:14:758) (run on last commit on main) and [after](https://github.com/pytorch-labs/tokenizers/actions/runs/14526152504/job/40757962551#step:14:901) (this pr).

Tokenizer library size (from `ls -lh build/libtokenizers.a`): `13M` (on main) -> `15M`. This most likely comes from adding the `pcre2` lib.

🧱 Stack:
- [ ] #45
- [ ] #48
- [ ] #49
- [x] #50

Pull Request resolved: #50

Differential Revision: D73295314

Pulled By: jackzhxng
jackzhxng added a commit that referenced this pull request Apr 21, 2025
Summary:
Adds pcre2 to handle the negative lookbehinds in HuggingFace tokenizers.

Performance stays about the same from test runs [before](https://github.com/pytorch-labs/tokenizers/actions/runs/14480863330/job/40617329721#step:14:758) (run on last commit on main) and [after](https://github.com/pytorch-labs/tokenizers/actions/runs/14526152504/job/40757962551#step:14:901) (this pr).

Tokenizer library size (from `ls -lh build/libtokenizers.a`): `13M` (on main) -> `15M`. This most likely comes from adding the `pcre2` lib.

🧱 Stack:
- [ ] #45
- [ ] #48
- [ ] #49
- [x] #50

Pull Request resolved: #50

Differential Revision: D73295314

Pulled By: jackzhxng
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants