HuggingFaceModelTokenizer #2723

krammnic · 2025-05-12T17:30:34Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Please link to any issues this PR addresses.

Changelog

What are the changes made in this PR?

Integrate Hugging Face tokenizers with torchtune #2706

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example

I did not change any public API
I have added an example to docs or docstrings

Basically, this is a first pass (I'm thinking about how to add masking), but jinja render works surprisingly well.

pytorch-bot · 2025-05-12T17:30:38Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2723

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Merge Blocking SEVs

There is 1 active merge blocking SEVs. Please view them below:

(merge blocking) Long CI Queue Times: VolumeLimitExceeded

If you must merge, use @pytorchbot merge -f.

✅ No Failures

As of commit 2740df7 with merge base 4ff30ca ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ebsmothers

Thanks @krammnic for taking this one on! This will be huge for lowering the barrier to onboard new models. Let's definitely make sure to add unit tests for this one. (You can likely create some dummy tokenizer_config.json files and check them directly into the repo, since they should be pretty small.)

ebsmothers · 2025-05-13T00:43:34Z

torchtune/modules/transforms/tokenizers/_hf_tokenizer.py

+        special_tokens_mapping = {}
+        for token in self.special_tokens:
+            special_tokens_mapping[token] = self.base_tokenizer.encode(token)
+        rendered_template = self.template.render(


Wow this actually wound up being quite easy lol

Unfortunately, tool calling will still be quite tricky

@krammnic Other than the lack of tool calls in tt Message class, is there any other reasons behind why tool calling will be tricky?

Probably, no

ebsmothers · 2025-05-13T00:46:34Z

torchtune/modules/transforms/tokenizers/_hf_tokenizer.py

+            if content := token_info.get("content"):
+                special_tokens.add(content)
+
+    # We sort lexicographically in order to get real tokens after all <|dummy_x|>


Sorry I don't fully understand this comment. I assume this is referring to reserved special tokens? If so, why is string sort the thing to use here?

Probably we can drop it, it might just simplify debugging in case if we face some problems with new configs.

ebsmothers · 2025-05-13T00:49:30Z

torchtune/modules/transforms/tokenizers/_hf_tokenizer.py

+        self.base_tokenizer = HuggingFaceBaseTokenizer(
+            tokenizer_json_path=tokenizer_json_path,
+            tokenizer_config_json_path=tokenizer_config_json_path,
+            generation_config_path=generation_config_path,
+        )


I know @joecummings had some thoughts on whether we should use generic base_tokenizer instead of constraining to use HuggingFaceBaseTokenizer. I suspect the latter is better for making sure everything works together, but I know at least Qwen2Tokenizer still relies on the merges + vocab files instead of the tokenizer.json file (I alluded to this at the very bottom of #2706). So we should figure out if this will work for that case

ebsmothers · 2025-05-13T00:50:43Z

torchtune/modules/transforms/tokenizers/_hf_tokenizer.py

+                {"role": m.role, "content": m.content[0]["content"]} for m in messages
+            ],
+            add_generation_prompt=add_eos,
+            **special_tokens_mapping,  # We assume that the naming is consitent


Yeah I think this should be a reasonable assumption (as long as we are also getting the special_tokens from the same place as the template)

ebsmothers

Thanks for your patience! Left a handful of comments. Personally I would just add a unit test now, it'll make it easier to reason about things and help validate that this is giving the expected results.

ebsmothers · 2025-05-19T15:47:30Z

torchtune/modules/transforms/tokenizers/_hf_tokenizer.py

+        self.truncation_type = truncation_type
+
+    def _raise_helper(self, msg):
+        raise Exception(msg)


Any reason to use this instead of the more specific Jinja Template Error used by HF here?

ebsmothers · 2025-05-19T15:51:43Z

torchtune/modules/transforms/tokenizers/_hf_tokenizer.py

+    def _raise_helper(self, msg):
+        raise Exception(msg)
+
+    def _get_token_from_config(self, config: Dict[str, Any], key: str) -> str:


I'm confused by this method. Based on the docstring and implementation it seems like it is being used to get special tokens. But then you are only using it for chat_template, which iiuc is always a string.

ebsmothers · 2025-05-19T15:53:04Z

torchtune/modules/transforms/tokenizers/_hf_tokenizer.py

+                messages=current_messages,
+                add_generation_prompt=add_eos if i == len(messages) - 1 else False,
+                **special_tokens_mapping,  # We assume that the naming is consistent
+                **self.top_level_variables,


Noob q: are top-level variables always sufficient? (I.e. is there ever a case where HF templates based on some nested field?)

Yes, in some configs the special token might be bos_id, for instance, but the bos is used in the chat_template, which is redefined as the top-level variable. Generally, we know that there is a possibility to pass some extra variables. Therefore, it is better to prevent some errors here with such trick.

ebsmothers · 2025-05-19T15:56:26Z

torchtune/modules/transforms/tokenizers/_hf_tokenizer.py

+
+            rendered = self.template.render(
+                messages=current_messages,
+                add_generation_prompt=add_eos if i == len(messages) - 1 else False,


What are the implications of this? E.g. during finetuning do we actually want to add a generation prompt at the end of all the messages? (I would assume no)

Good point!

No, bad point, we should have this argument, in another case we will not be able to render.

Ah I think you're right, it is a bad point. I think I may have misread this line.. my point was that we should only add the generation prompt during inference, but I see that add_eos is basically a proxy for inference

I am a bit confused. Does render expect as inputs an arbritary set of arguments in the format: "token_str": "token_id"? I need to look at the api, but dropping the question here for now

ebsmothers · 2025-05-19T15:57:34Z

torchtune/modules/transforms/tokenizers/_hf_tokenizer.py

+        # This part is extremely hacky, but we need to handle case where we have variable access with jinja
+        special_tokens_mapping = {}
+        for token in self.special_tokens:
+            special_tokens_mapping[token] = self.base_tokenizer.encode(token)


I don't think I fully understand this comment. I also don't understand why we need to rebuild the special tokens mapping on every invocation of tokenize_messages. (Is that what the comment is referring to?)

this comment became a legacy during the changes, good catch

ebsmothers · 2025-05-19T15:59:24Z

torchtune/modules/transforms/tokenizers/_hf_tokenizer.py

+            if message.masked:
+                tokenized_messages.extend([True] * len(delta))
+            else:
+                tokenized_messages.extend(delta)
+
+            mask.extend([message.masked] * len(delta))


This doesn't seem right to me

ebsmothers · 2025-05-19T16:00:51Z

torchtune/modules/transforms/tokenizers/_hf_tokenizer.py

+        tokenized_messages = truncate(
+            tokens=tokenized_messages,
+            max_seq_len=max_seq_len,
+            eos_id=None,


Should we also be adding eos here?

ebsmothers · 2025-05-19T16:03:55Z

torchtune/modules/transforms/tokenizers/_hf_tokenizer.py

+            current_messages = [
+                {"role": m.role, "content": m.content[0]["content"]}
+                for m in messages[: i + 1]
+            ]


This seems a bit strange to me.. in my mind we should either be able to (a) render/tokenize all messages in one shot, or (b) loop over messages one at a time, render, tokenize, and concat. Why do we need to do this "cumulative tokenization"? (Is it because of the difficulties you mentioned with masking? If so I wonder whether there is an alternative)

krammnic · 2025-05-26T18:22:52Z

Let me push unit test and we will iterate on this one more time

krammnic · 2025-05-28T12:37:39Z

@ebsmothers Let's iterate

krammnic · 2025-06-03T19:41:35Z

@ebsmothers @joecummings fixed CI

ebsmothers

Thanks for your patience @krammnic. I think there is still some small difference in how the Jinja template gets applied vs our implementations. See e.g. this script. Main differences I observe:

Extra <|begin_of_text|> at the beginning
HuggingFaceModelTokenizer includes the generation prompt, our version does not
The behavior of add_end_tokens seems not to match

(Btw it's possible that some of these are due to choices specific to our Llama3 tokenizer, I mainly want to make sure we are able to demonstrate equivalent results across the HF version and our native torchtune ones.)

ebsmothers · 2025-06-03T21:20:26Z

tests/torchtune/modules/transforms/tokenizers/test_hf_model_tokenizer.py

+
+TOKENIZER_CONFIG_PATH = ASSETS / "tokenizer_config_gemma.json"
+GENERATION_CONFIG_PATH = ASSETS / "generation_config_gemma.json"
+TOKENIZER_PATH = ASSETS / "tokenizer_gemma_croped.json"


nit (should also update the filename)

Suggested change

TOKENIZER_PATH = ASSETS / "tokenizer_gemma_croped.json"

TOKENIZER_PATH = ASSETS / "tokenizer_gemma_cropped.json"

Also looks like this one is quite large (over 8 MB)? Is there any way we can make a smaller version (e.g. most of our other tokenizer assets are < 1 MB).

Yeah, let me build such

ebsmothers · 2025-06-03T21:22:56Z

tests/torchtune/modules/transforms/tokenizers/test_hf_model_tokenizer.py

+            107,
+            108,
+        ]
+        assert mask[:-4] == [False] * 23


Noob q: why do we only test up to -4?

It might be a bug, our tokenizer version adds extra "model" role at the and. Why have I done like this? This behavior seems to be desired by the chat template, but transformers do not add it. Will investigate more and update unit test

ebsmothers · 2025-06-03T21:28:00Z

torchtune/modules/transforms/tokenizers/_hf_tokenizer.py

+        tokenizer_config_json_path: str | None = None,
+        generation_config_path: str | None = None,


Unfortunately we still run Python 3.9 in our CI so the jobs will fail with this 😕 . You can just use Union for now until we upgrade

Yeah, I've noticed this let me update the type hints then

Co-authored-by: Felipe Mello <[email protected]>

ebsmothers · 2025-06-04T23:08:59Z

torchtune/modules/transforms/tokenizers/_hf_tokenizer.py

+        # It is used sometimes in HF chat_templates
+        _env.globals["raise_exception"] = self._raise_helper
+
+        self.template = _env.from_string(config["chat_template"])


There are cases where this field is not present, right? E.g. for google/gemma-2-2b (it's only present in the instruct-tuned version). I think for cases like these we should raise an explicit error here.

I know this is merged, but LLaMA-4 looks like it has a separate file - chat_template.jinja - rather than containing it in the config. This might be another pattern that needs supporting.

ebsmothers · 2025-06-05T00:30:32Z

torchtune/modules/transforms/tokenizers/_hf_tokenizer.py

+                **self.top_level_variables,
+            )
+
+            current_tokens = self.base_tokenizer.encode(rendered, add_eos=False)


Regarding the duplicate bos tokens, I suspect we just need to add an additional case here to deal with the fact that we are often passing a string like "<bos>..." to the base_tokenizer's encode method

ebsmothers · 2025-06-05T00:50:50Z

In addition to @felipemello1's comments, let's sanity check that we get the same results with this as we get with HF's apply_chat_template. Similar to what I shared yesterday, if we can show a script like this gives the same results for a couple of models this should be good to go. (At least for this script, I get duplicate BOS at the beginning, which should be resolvable via my comment. I also get a newline in the HF one which ours does not have, not sure what that's about..)

Separately, please make sure to use a smaller version of tokenizer.json in your unit test when you get a chance.

krammnic · 2025-06-05T08:33:38Z

@ebsmothers I don't like the fact that CI failed only for 3.9 and with:

ValueError: Unable to load checkpoint from /tmp/pytest-of-ec2-user/pytest-0/test_training_state_on_resume_14/recipe_state/recipe_state.pt.

It might be some separate issue....

codecov-commenter · 2025-06-05T22:08:33Z

Codecov Report

Attention: Patch coverage is 96.87500% with 3 lines in your changes missing coverage. Please review.

Project coverage is 60.22%. Comparing base (9cb77af) to head (c4b9c35).
Report is 6 commits behind head on main.

Files with missing lines	Patch %	Lines
...une/modules/transforms/tokenizers/_hf_tokenizer.py	96.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2723      +/-   ##
==========================================
+ Coverage   60.08%   60.22%   +0.14%     
==========================================
  Files         435      435              
  Lines       26742    26828      +86     
==========================================
+ Hits        16067    16157      +90     
+ Misses      10675    10671       -4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

SalmanMohammadi · 2025-06-05T22:12:19Z

@krammnic Did you get a chance to reduce the file size and also minify the tokenizer test assets? The diff should just be one line for those files.

krammnic · 2025-06-05T22:14:24Z

@krammnic Did you get a chance to reduce the file size and also minify the tokenizer test assets? The diff should just be one line for those files.

Have done it briefly, but reduced on 2 times. I think it is sufficient (under 5 mb)

krammnic · 2025-06-05T22:18:16Z

I don't think that we've minified any other assets actually

ebsmothers

@krammnic thank you for sticking with this one! This will be a huge help towards getting new models in faster.

This PR was reopened (likely due to being reverted), so your approval was removed. Please request another review.

ebsmothers

lol i accidentally closed the pr. time to log off for the day

huggingface model tokenizer

3131906

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 12, 2025

ebsmothers reviewed May 13, 2025

View reviewed changes

Mark Obozov added 7 commits May 14, 2025 18:35

fixes

902051b

lint

9c27bc2

remove debug print

b985e76

masking

6771cf6

lint

785442c

remove lexicographic sort

06143e0

remove useless comment

97f1165

krammnic mentioned this pull request May 16, 2025

Add tool calling to OpenAIToMessages Transform #2617

Open

ebsmothers reviewed May 19, 2025

View reviewed changes

Mark Obozov added 6 commits May 19, 2025 19:36

typehints to pass unittests for python3.9

7864b87

change raise helper

b8e3ad1

fixes

a44fbdf

eos

3f9ca8f

revert generation_prompt

57852ad

lint

6b2fae3

unit test

0153b90

This was referenced May 28, 2025

Eval chat template #2574

Merged

[WIP] DSV3 #2764

Draft

krammnic and others added 4 commits June 2, 2025 11:41

Merge branch 'main' into tokenizer

e5bc651

fix lint

027757b

typehints

d1cd4a3

fix assets

55cddb1

ebsmothers reviewed Jun 3, 2025

View reviewed changes

Update torchtune/modules/transforms/tokenizers/_hf_tokenizer.py

eb198ef

Co-authored-by: Felipe Mello <[email protected]>

ebsmothers reviewed Jun 4, 2025

View reviewed changes

ebsmothers reviewed Jun 5, 2025

View reviewed changes

Mark Obozov added 9 commits June 5, 2025 21:24

<bos> check

e5416c1

docstring

6cdcfed

bos

4f74bf8

real bos fix

448de09

update test

2a8a368

real bos token fix

f4b6132

lint

7eebb2a

fix test

2a4907c

compress

371a601

krammnic changed the title ~~[WIP] HuggingFaceModelTokenizer~~ HuggingFaceModelTokenizer Jun 5, 2025

test

c4b9c35

Mark Obozov added 2 commits June 6, 2025 01:59

update asset

dfe7b7e

lint

2740df7

ebsmothers previously approved these changes Jun 6, 2025

View reviewed changes

ebsmothers closed this Jun 6, 2025

ebsmothers reopened this Jun 6, 2025

ebsmothers approved these changes Jun 6, 2025

View reviewed changes

ebsmothers merged commit 6d2c122 into pytorch:main Jun 6, 2025
22 of 25 checks passed

krammnic mentioned this pull request Jun 6, 2025

Qwen2/2.5/3 max_seq_len and EOS Masking #2792

Open

ebsmothers mentioned this pull request Jun 30, 2025

Integrate Hugging Face tokenizers with torchtune #2706

Closed

	TOKENIZER_PATH = ASSETS / "tokenizer_gemma_croped.json"
	TOKENIZER_PATH = ASSETS / "tokenizer_gemma_cropped.json"

		tokenizer_config_json_path: str \| None = None,
		generation_config_path: str \| None = None,

HuggingFaceModelTokenizer #2723

HuggingFaceModelTokenizer #2723

Uh oh!

Conversation

krammnic commented May 12, 2025

Context

Changelog

Test plan

UX

Uh oh!

pytorch-bot bot commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2723

❗ 1 Merge Blocking SEVs

✅ No Failures

Uh oh!

ebsmothers left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebsmothers left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

krammnic commented May 26, 2025

Uh oh!

krammnic commented May 28, 2025

Uh oh!

krammnic commented Jun 3, 2025

Uh oh!

ebsmothers left a comment

Choose a reason for hiding this comment

Uh oh!

pytorch-bot bot commented May 12, 2025 •

edited

Loading

krammnic Jun 4, 2025 •

edited

Loading

ebsmothers Jun 5, 2025 •

edited

Loading

krammnic commented Jun 5, 2025 •

edited

Loading

codecov-commenter commented Jun 5, 2025 •

edited

Loading