Skip to content

HuggingFaceModelTokenizer #2723

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 34 commits into from
Jun 6, 2025
Merged

HuggingFaceModelTokenizer #2723

merged 34 commits into from
Jun 6, 2025

Conversation

krammnic
Copy link
Contributor

Context

What is the purpose of this PR? Is it to

  • add a new feature
  • fix a bug
  • update tests and/or documentation
  • other (please add here)

Please link to any issues this PR addresses.

Changelog

What are the changes made in this PR?

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

  • run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
  • add unit tests for any new functionality
  • update docstrings for any new or updated methods or classes
  • run unit tests via pytest tests
  • run recipe tests via pytest tests -m integration_test
  • manually run any new or modified recipes with sufficient proof of correctness
  • include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example

  • I did not change any public API
  • I have added an example to docs or docstrings

Basically, this is a first pass (I'm thinking about how to add masking), but jinja render works surprisingly well.

Copy link

pytorch-bot bot commented May 12, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2723

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Merge Blocking SEVs

There is 1 active merge blocking SEVs. Please view them below:

If you must merge, use @pytorchbot merge -f.

✅ No Failures

As of commit 2740df7 with merge base 4ff30ca (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 12, 2025
Copy link
Contributor

@ebsmothers ebsmothers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @krammnic for taking this one on! This will be huge for lowering the barrier to onboard new models. Let's definitely make sure to add unit tests for this one. (You can likely create some dummy tokenizer_config.json files and check them directly into the repo, since they should be pretty small.)

special_tokens_mapping = {}
for token in self.special_tokens:
special_tokens_mapping[token] = self.base_tokenizer.encode(token)
rendered_template = self.template.render(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow this actually wound up being quite easy lol

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, tool calling will still be quite tricky

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@krammnic Other than the lack of tool calls in tt Message class, is there any other reasons behind why tool calling will be tricky?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, no

if content := token_info.get("content"):
special_tokens.add(content)

# We sort lexicographically in order to get real tokens after all <|dummy_x|>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I don't fully understand this comment. I assume this is referring to reserved special tokens? If so, why is string sort the thing to use here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we can drop it, it might just simplify debugging in case if we face some problems with new configs.

Comment on lines +203 to +207
self.base_tokenizer = HuggingFaceBaseTokenizer(
tokenizer_json_path=tokenizer_json_path,
tokenizer_config_json_path=tokenizer_config_json_path,
generation_config_path=generation_config_path,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know @joecummings had some thoughts on whether we should use generic base_tokenizer instead of constraining to use HuggingFaceBaseTokenizer. I suspect the latter is better for making sure everything works together, but I know at least Qwen2Tokenizer still relies on the merges + vocab files instead of the tokenizer.json file (I alluded to this at the very bottom of #2706). So we should figure out if this will work for that case

{"role": m.role, "content": m.content[0]["content"]} for m in messages
],
add_generation_prompt=add_eos,
**special_tokens_mapping, # We assume that the naming is consitent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think this should be a reasonable assumption (as long as we are also getting the special_tokens from the same place as the template)

Copy link
Contributor

@ebsmothers ebsmothers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience! Left a handful of comments. Personally I would just add a unit test now, it'll make it easier to reason about things and help validate that this is giving the expected results.

self.truncation_type = truncation_type

def _raise_helper(self, msg):
raise Exception(msg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to use this instead of the more specific Jinja Template Error used by HF here?

def _raise_helper(self, msg):
raise Exception(msg)

def _get_token_from_config(self, config: Dict[str, Any], key: str) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by this method. Based on the docstring and implementation it seems like it is being used to get special tokens. But then you are only using it for chat_template, which iiuc is always a string.

messages=current_messages,
add_generation_prompt=add_eos if i == len(messages) - 1 else False,
**special_tokens_mapping, # We assume that the naming is consistent
**self.top_level_variables,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noob q: are top-level variables always sufficient? (I.e. is there ever a case where HF templates based on some nested field?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in some configs the special token might be bos_id, for instance, but the bos is used in the chat_template, which is redefined as the top-level variable. Generally, we know that there is a possibility to pass some extra variables. Therefore, it is better to prevent some errors here with such trick.


rendered = self.template.render(
messages=current_messages,
add_generation_prompt=add_eos if i == len(messages) - 1 else False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the implications of this? E.g. during finetuning do we actually want to add a generation prompt at the end of all the messages? (I would assume no)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, bad point, we should have this argument, in another case we will not be able to render.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I think you're right, it is a bad point. I think I may have misread this line.. my point was that we should only add the generation prompt during inference, but I see that add_eos is basically a proxy for inference

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit confused. Does render expect as inputs an arbritary set of arguments in the format: "token_str": "token_id"? I need to look at the api, but dropping the question here for now

Comment on lines 257 to 260
# This part is extremely hacky, but we need to handle case where we have variable access with jinja
special_tokens_mapping = {}
for token in self.special_tokens:
special_tokens_mapping[token] = self.base_tokenizer.encode(token)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I fully understand this comment. I also don't understand why we need to rebuild the special tokens mapping on every invocation of tokenize_messages. (Is that what the comment is referring to?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment became a legacy during the changes, good catch

Comment on lines 284 to 289
if message.masked:
tokenized_messages.extend([True] * len(delta))
else:
tokenized_messages.extend(delta)

mask.extend([message.masked] * len(delta))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem right to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops

tokenized_messages = truncate(
tokens=tokenized_messages,
max_seq_len=max_seq_len,
eos_id=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also be adding eos here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should

Comment on lines +267 to +270
current_messages = [
{"role": m.role, "content": m.content[0]["content"]}
for m in messages[: i + 1]
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit strange to me.. in my mind we should either be able to (a) render/tokenize all messages in one shot, or (b) loop over messages one at a time, render, tokenize, and concat. Why do we need to do this "cumulative tokenization"? (Is it because of the difficulties you mentioned with masking? If so I wonder whether there is an alternative)

@krammnic
Copy link
Contributor Author

Let me push unit test and we will iterate on this one more time

@krammnic
Copy link
Contributor Author

@ebsmothers Let's iterate

This was referenced May 28, 2025
@krammnic
Copy link
Contributor Author

krammnic commented Jun 3, 2025

@ebsmothers @joecummings fixed CI

Copy link
Contributor

@ebsmothers ebsmothers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience @krammnic. I think there is still some small difference in how the Jinja template gets applied vs our implementations. See e.g. this script. Main differences I observe:

  1. Extra <|begin_of_text|> at the beginning
  2. HuggingFaceModelTokenizer includes the generation prompt, our version does not
  3. The behavior of add_end_tokens seems not to match

(Btw it's possible that some of these are due to choices specific to our Llama3 tokenizer, I mainly want to make sure we are able to demonstrate equivalent results across the HF version and our native torchtune ones.)


TOKENIZER_CONFIG_PATH = ASSETS / "tokenizer_config_gemma.json"
GENERATION_CONFIG_PATH = ASSETS / "generation_config_gemma.json"
TOKENIZER_PATH = ASSETS / "tokenizer_gemma_croped.json"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (should also update the filename)

Suggested change
TOKENIZER_PATH = ASSETS / "tokenizer_gemma_croped.json"
TOKENIZER_PATH = ASSETS / "tokenizer_gemma_cropped.json"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also looks like this one is quite large (over 8 MB)? Is there any way we can make a smaller version (e.g. most of our other tokenizer assets are < 1 MB).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let me build such

107,
108,
]
assert mask[:-4] == [False] * 23
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noob q: why do we only test up to -4?

Copy link
Contributor Author

@krammnic krammnic Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a bug, our tokenizer version adds extra "model" role at the and. Why have I done like this? This behavior seems to be desired by the chat template, but transformers do not add it. Will investigate more and update unit test

Comment on lines 42 to 43
tokenizer_config_json_path: str | None = None,
generation_config_path: str | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately we still run Python 3.9 in our CI so the jobs will fail with this 😕 . You can just use Union for now until we upgrade

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've noticed this let me update the type hints then

# It is used sometimes in HF chat_templates
_env.globals["raise_exception"] = self._raise_helper

self.template = _env.from_string(config["chat_template"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are cases where this field is not present, right? E.g. for google/gemma-2-2b (it's only present in the instruct-tuned version). I think for cases like these we should raise an explicit error here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is merged, but LLaMA-4 looks like it has a separate file - chat_template.jinja - rather than containing it in the config. This might be another pattern that needs supporting.

**self.top_level_variables,
)

current_tokens = self.base_tokenizer.encode(rendered, add_eos=False)
Copy link
Contributor

@ebsmothers ebsmothers Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the duplicate bos tokens, I suspect we just need to add an additional case here to deal with the fact that we are often passing a string like "<bos>..." to the base_tokenizer's encode method

@ebsmothers
Copy link
Contributor

In addition to @felipemello1's comments, let's sanity check that we get the same results with this as we get with HF's apply_chat_template. Similar to what I shared yesterday, if we can show a script like this gives the same results for a couple of models this should be good to go. (At least for this script, I get duplicate BOS at the beginning, which should be resolvable via my comment. I also get a newline in the HF one which ours does not have, not sure what that's about..)

Separately, please make sure to use a smaller version of tokenizer.json in your unit test when you get a chance.

@krammnic
Copy link
Contributor Author

krammnic commented Jun 5, 2025

@ebsmothers I don't like the fact that CI failed only for 3.9 and with:

ValueError: Unable to load checkpoint from /tmp/pytest-of-ec2-user/pytest-0/test_training_state_on_resume_14/recipe_state/recipe_state.pt.

It might be some separate issue....

@krammnic krammnic changed the title [WIP] HuggingFaceModelTokenizer HuggingFaceModelTokenizer Jun 5, 2025
@codecov-commenter
Copy link

codecov-commenter commented Jun 5, 2025

Codecov Report

Attention: Patch coverage is 96.87500% with 3 lines in your changes missing coverage. Please review.

Project coverage is 60.22%. Comparing base (9cb77af) to head (c4b9c35).
Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
...une/modules/transforms/tokenizers/_hf_tokenizer.py 96.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2723      +/-   ##
==========================================
+ Coverage   60.08%   60.22%   +0.14%     
==========================================
  Files         435      435              
  Lines       26742    26828      +86     
==========================================
+ Hits        16067    16157      +90     
+ Misses      10675    10671       -4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@SalmanMohammadi
Copy link
Collaborator

@krammnic Did you get a chance to reduce the file size and also minify the tokenizer test assets? The diff should just be one line for those files.

@krammnic
Copy link
Contributor Author

krammnic commented Jun 5, 2025

@krammnic Did you get a chance to reduce the file size and also minify the tokenizer test assets? The diff should just be one line for those files.

Have done it briefly, but reduced on 2 times. I think it is sufficient (under 5 mb)

@krammnic
Copy link
Contributor Author

krammnic commented Jun 5, 2025

I don't think that we've minified any other assets actually

Mark Obozov added 2 commits June 6, 2025 01:59
ebsmothers
ebsmothers previously approved these changes Jun 6, 2025
Copy link
Contributor

@ebsmothers ebsmothers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@krammnic thank you for sticking with this one! This will be a huge help towards getting new models in faster.

@ebsmothers ebsmothers closed this Jun 6, 2025
@ebsmothers ebsmothers reopened this Jun 6, 2025
@pytorch-bot pytorch-bot bot dismissed ebsmothers’s stale review June 6, 2025 01:00

This PR was reopened (likely due to being reverted), so your approval was removed. Please request another review.

Copy link
Contributor

@ebsmothers ebsmothers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol i accidentally closed the pr. time to log off for the day

@ebsmothers ebsmothers merged commit 6d2c122 into pytorch:main Jun 6, 2025
22 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants