Skip to content

Unified tokenizer type onboarding #1540

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

srikary12
Copy link

Resolves #1536

Moves TokenizerType to a centralized place.

Adds support at different places.

To do

Add documentation for future tokenizer onboard

Copy link

pytorch-bot bot commented May 11, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1540

Note: Links to docs will display an error until the docs builds have been completed.

❌ 8 New Failures, 3 Cancelled Jobs

As of commit 22fdd56 with merge base 71552fd (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOBS - The following jobs were cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link

Hi @srikary12!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

@facebook-github-bot
Copy link

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 11, 2025
Copy link
Contributor

@zhenyan-zhang-meta zhenyan-zhang-meta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

Some comments in general:

  • We haven't touched

    torchchat/dist_run.py

    Lines 67 to 69 in 0299a37

    class TokenizerType(Enum):
    Tiktoken = auto()
    SentencePiece = auto()
    yet, can we use your newly created enum class there as well?
  • I've helped to run tests python torchchat.py generate llama2|llama3|granite-code and it works. The CI might have some issue in the previous run or it might be from merge conflicts. I've resolved the conflicts but in case if there's still CI issue we might need to take a look. cc @Jack-Khuu

@zhenyan-zhang-meta
Copy link
Contributor

cc @Jack-Khuu for a review as well.

@srikary12
Copy link
Author

@zhenyan-zhang-meta I've modified dist_run.py and resolved conversations.

@srikary12
Copy link
Author

I'll fix test cases. Looks like I missed some comparisons.

@srikary12
Copy link
Author

@zhenyan-zhang-meta I've checked all the cases. Let me know if I am missing something.

@zhenyan-zhang-meta zhenyan-zhang-meta marked this pull request as ready for review May 28, 2025 06:50
@zhenyan-zhang-meta
Copy link
Contributor

zhenyan-zhang-meta commented May 28, 2025

@srikary12 I think overall it looks fine, thanks for the contribution! Just turned on the CI for some extra checks.

Meanwhile if you could update the doc it would be great.

@srikary12
Copy link
Author

@zhenyan-zhang-meta we might need to check CI/CD. Not sure if I am missing something. I'll try to add documentation asap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve Tokenizer New Type Onboarding
3 participants