Skip to content

Add pcre2 as re2 fallback #50

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 21, 2025
Merged

Add pcre2 as re2 fallback #50

merged 1 commit into from
Apr 21, 2025

Conversation

jackzhxng
Copy link
Contributor

@jackzhxng jackzhxng commented Apr 15, 2025

Adds pcre2 to handle the negative lookbehinds in HuggingFace tokenizers.

Performance stays about the same from test runs before (run on last commit on main) and after (this pr).

Tokenizer library size (from ls -lh build/libtokenizers.a): 13M (on main) -> 15M. This most likely comes from adding the pcre2 lib.

🧱 Stack:

facebook-github-bot pushed a commit that referenced this pull request Apr 18, 2025
Summary:
🧱 Stack:
- [ ] #45
- [ ] #48
- [x] #49
- [ ] #50

### Testing
Pass CI
```
cmake -DTOKENIZERS_BUILD_TEST=ON -DCMAKE_BUILD_TYPE=Debug . -Bbuild && cmake --build build -j9 --config Debug
(cd build && ctest)
```


Differential Revision: D73238728

Pulled By: jackzhxng
facebook-github-bot pushed a commit that referenced this pull request Apr 18, 2025
Summary:
🧱 Stack:
- [ ] #45
- [ ] #48
- [x] #49
- [ ] #50

### Testing
Pass CI
```
cmake -DTOKENIZERS_BUILD_TEST=ON -DCMAKE_BUILD_TYPE=Debug . -Bbuild && cmake --build build -j9 --config Debug
(cd build && ctest)
```


Differential Revision: D73238728

Pulled By: jackzhxng
@jackzhxng jackzhxng changed the base branch from jz/regex-2 to main April 19, 2025 02:01
@facebook-github-bot
Copy link
Contributor

@jackzhxng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D73295314

jackzhxng added a commit that referenced this pull request Apr 21, 2025
Summary:
Adds pcre2 to handle the negative lookbehinds in HuggingFace tokenizers.

Performance stays about the same from test runs [before](https://github.com/pytorch-labs/tokenizers/actions/runs/14480863330/job/40617329721#step:14:758) (run on last commit on main) and [after](https://github.com/pytorch-labs/tokenizers/actions/runs/14526152504/job/40757962551#step:14:901) (this pr).

Tokenizer library size (from `ls -lh build/libtokenizers.a`): `13M` (on main) -> `15M`. This most likely comes from adding the `pcre2` lib.

🧱 Stack:
- [ ] #45
- [ ] #48
- [ ] #49
- [x] #50

Pull Request resolved: #50

Differential Revision: D73295314

Pulled By: jackzhxng
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D73295314

jackzhxng added a commit that referenced this pull request Apr 21, 2025
Summary:
Adds pcre2 to handle the negative lookbehinds in HuggingFace tokenizers.

Performance stays about the same from test runs [before](https://github.com/pytorch-labs/tokenizers/actions/runs/14480863330/job/40617329721#step:14:758) (run on last commit on main) and [after](https://github.com/pytorch-labs/tokenizers/actions/runs/14526152504/job/40757962551#step:14:901) (this pr).

Tokenizer library size (from `ls -lh build/libtokenizers.a`): `13M` (on main) -> `15M`. This most likely comes from adding the `pcre2` lib.

🧱 Stack:
- [ ] #45
- [ ] #48
- [ ] #49
- [x] #50

Pull Request resolved: #50

Differential Revision: D73295314

Pulled By: jackzhxng
Summary:
Adds pcre2 to handle the negative lookbehinds in HuggingFace tokenizers.

Performance stays about the same from test runs [before](https://github.com/pytorch-labs/tokenizers/actions/runs/14480863330/job/40617329721#step:14:758) (run on last commit on main) and [after](https://github.com/pytorch-labs/tokenizers/actions/runs/14526152504/job/40757962551#step:14:901) (this pr).

Tokenizer library size (from `ls -lh build/libtokenizers.a`): `13M` (on main) -> `15M`. This most likely comes from adding the `pcre2` lib.

🧱 Stack:
- [ ] #45
- [ ] #48
- [ ] #49
- [x] #50

Pull Request resolved: #50

Differential Revision: D73295314

Pulled By: jackzhxng
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D73295314

@jackzhxng jackzhxng merged commit 9378e21 into main Apr 21, 2025
5 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants