Unified tokenizer type onboarding #1540

srikary12 · 2025-05-11T07:36:19Z

Resolves #1536

Moves TokenizerType to a centralized place.

Adds support at different places.

To do

Add documentation for future tokenizer onboard

pytorch-bot · 2025-05-11T07:36:22Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1540

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 8 New Failures, 3 Cancelled Jobs

As of commit 22fdd56 with merge base 71552fd ():

NEW FAILURES - The following jobs have failed:

pull / runner-aoti (macos-14-xlarge) (gh)
Process completed with exit code 1.
pull / runner-et (macos-14-xlarge) (gh)
Process completed with exit code 1.
pull / test-build-runner-et-android / linux-job (gh)
RuntimeError: Command docker exec -t 9214b36d7c4eb5649d9bac39b5af7da9cfd9e286220c68cc9642b09afd9098f8 /exec failed with exit code 1
pull / test-tinystories-executorch (macos-14-xlarge) (gh)
Process completed with exit code 1.
pull / test-torchao-experimental-cpp (macos-14-xlarge) (gh)
Process completed with exit code 1.
pull / test-torchao-experimental-et (macos-14-xlarge) (gh)
Process completed with exit code 1.
Run parallel prefill / test-sdpa-backends-export / linux-job (gh)
RuntimeError: Command docker exec -t 6906f7bb7689649eb3ab05f6153c0f351c9f5b4d528ad192b8ecf2ff2ec818a6 /exec failed with exit code 1
Run the aoti runner with CUDA using stories / test-runner-aot-cuda / linux-job (gh)
RuntimeError: Command docker exec -t abfc46d5c1382824d2b37b692d54a247e7916d17a475160e4fb9885ecc66f351 /exec failed with exit code 1

CANCELLED JOBS - The following jobs were cancelled. Please retry:

pull / runner-aoti (16-core-ubuntu) (gh)
##[error]The operation was canceled.
pull / runner-et (16-core-ubuntu) (gh)
##[error]The operation was canceled.
pull / test-tinystories-executorch (16-core-ubuntu) (gh)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2025-05-11T07:36:25Z

Hi @srikary12!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

facebook-github-bot · 2025-05-11T08:08:45Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

zhenyan-zhang-meta

Thanks for the contribution!

Some comments in general:

We haven't touched

torchchat/dist_run.py

Lines 67 to 69 in 0299a37

    
           class TokenizerType(Enum): 
        
               Tiktoken = auto() 
        
               SentencePiece = auto()

yet, can we use your newly created enum class there as well?

I've helped to run tests python torchchat.py generate llama2|llama3|granite-code and it works. The CI might have some issue in the previous run or it might be from merge conflicts. I've resolved the conflicts but in case if there's still CI issue we might need to take a look. cc @Jack-Khuu

tokenizer/tokenizer_type.py

torchchat/cli/builder.py

zhenyan-zhang-meta · 2025-05-13T18:01:14Z

cc @Jack-Khuu for a review as well.

…ndant methods

…ikary12/torchchat into Tokenizer-New-Type-Onboarding

srikary12 · 2025-05-14T14:57:10Z

@zhenyan-zhang-meta I've modified dist_run.py and resolved conversations.

srikary12 · 2025-05-14T17:50:50Z

I'll fix test cases. Looks like I missed some comparisons.

srikary12 · 2025-05-27T02:44:46Z

@zhenyan-zhang-meta I've checked all the cases. Let me know if I am missing something.

zhenyan-zhang-meta · 2025-05-28T06:54:58Z

@srikary12 I think overall it looks fine, thanks for the contribution! Just turned on the CI for some extra checks.

Meanwhile if you could update the doc it would be great.

srikary12 · 2025-06-02T16:02:03Z

@zhenyan-zhang-meta we might need to check CI/CD. Not sure if I am missing something. I'll try to add documentation asap.

srikary12 · 2025-07-01T16:48:48Z

Tokenizers:tiktoken.cpp:71

I'll update it as required and should I add documentation? A few CI are breaking while installing requirements.

Unified tokenizer type onboarding

a94afd9

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 11, 2025

zhenyan-zhang-meta suggested changes May 13, 2025

View reviewed changes

tokenizer/tokenizer_type.py Outdated Show resolved Hide resolved

torchchat/cli/builder.py Outdated Show resolved Hide resolved

zhenyan-zhang-meta requested a review from Jack-Khuu May 13, 2025 18:00

zhenyan-zhang-meta and others added 3 commits May 13, 2025 11:02

Merge branch 'main' into Tokenizer-New-Type-Onboarding

3ad161c

Refactor tokenizer type handling to use Enum directly and remove redu…

3d2c8d6

…ndant methods

Merge branch 'Tokenizer-New-Type-Onboarding' of https://github.com/sr…

9cba67a

…ikary12/torchchat into Tokenizer-New-Type-Onboarding

srikary12 requested a review from zhenyan-zhang-meta May 14, 2025 14:57

Merge branch 'main' into Tokenizer-New-Type-Onboarding

7c59b92

srikary12 and others added 2 commits May 25, 2025 11:02

Merge branch 'main' into Tokenizer-New-Type-Onboarding

d77de35

Modify export to use unified tokenizer type

22fdd56

zhenyan-zhang-meta marked this pull request as ready for review May 28, 2025 06:50

zhenyan-zhang-meta approved these changes Jun 30, 2025

View reviewed changes

	class TokenizerType(Enum):
	Tiktoken = auto()
	SentencePiece = auto()

Unified tokenizer type onboarding #1540

Are you sure you want to change the base?

Unified tokenizer type onboarding #1540

Uh oh!

Conversation

srikary12 commented May 11, 2025

To do

Uh oh!

pytorch-bot bot commented May 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1540

❌ 8 New Failures, 3 Cancelled Jobs

Uh oh!

facebook-github-bot commented May 11, 2025

Action Required

Process

Uh oh!

facebook-github-bot commented May 11, 2025

Uh oh!

zhenyan-zhang-meta left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zhenyan-zhang-meta commented May 13, 2025

Uh oh!

srikary12 commented May 14, 2025

Uh oh!

srikary12 commented May 14, 2025

Uh oh!

srikary12 commented May 27, 2025

Uh oh!

zhenyan-zhang-meta commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srikary12 commented Jun 2, 2025

Uh oh!

srikary12 commented Jul 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pytorch-bot bot commented May 11, 2025 •

edited

Loading

zhenyan-zhang-meta left a comment •

edited

Loading

zhenyan-zhang-meta commented May 28, 2025 •

edited

Loading