Skip to content

Update pattern key for split pretokenizer #38

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 9, 2025

Conversation

jackzhxng
Copy link
Contributor

@jackzhxng jackzhxng commented Mar 27, 2025

Was missing the "Regex" key, e.g.

"pretokenizers": [
      {
        "type": "Split",
        "pattern": {
          "Regex": "[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
        },
        "behavior": "Removed",
        "invert": true
      },
      ...
]

Test (looking into addressing the invalid perl operator negative lookahead):

>> cmake-out/examples/tokenize_tool/tokenize_tool hf_tokenizer ~/hf/models--microsoft--Phi-4-mini-instruct/snapshots/c0fb9e74abda11b496b7907a9c6c9009a7a0488f/tokenizer.json "Hello world!"

WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1743095300.202419 3915421 re2.cc:237] Error parsing '([^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]*[\p{Ll}\p{Lm}\p{Lo}\p{M}]+(?i:'s|'t|'re|'ve|'m|'l...': invalid perl operator: (?!
Vocab Size: 200029
BOS: 199999
EOS: 199999

PROMPT:
Hello world!

Encoding...
E0000 00:00:1743095300.500576 3915421 re2.cc:921] Invalid RE2: invalid perl operator: (?!
[ ]

Decoding...

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 27, 2025
@larryliu0820 larryliu0820 merged commit 4167468 into pytorch-labs:main Apr 9, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants