Skip to content

[GGUF] Refactor and decouple gguf checkpoint loading logic #34385

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 36 commits into from
Jan 6, 2025

Conversation

Isotr0py
Copy link
Collaborator

@Isotr0py Isotr0py commented Oct 24, 2024

What does this PR do?

Since there are more and more GGUF architectures have supported, the gguf loading logic (especially the load_gguf_checkpoint function) become complicated and couple tensor rename and tensor "reshape"/"permute" reverse.

Therefore, I think there is a need to refactor the gguf checkpoint loading logic for better maintenance.

This PR is going to refactor and cleanup the gguf model loading logic with following methods:

  • Introduce a get_gguf_hf_weights_map function that has been used in vLLM to infer gguf_weights_map (or GGUF_TENSOR_MAPPING we used in transformers currently) from model config with gguf package automatically.
  • Since GGUF_TENSOR_MAPPING can be inferred with above function, we can try to remove most of existing hardcoded mapping. And we don't need to ask contributor to add new mapping in GGUF_TENSOR_MAPPING anymore at most of cases.
  • Separate the GGUF tensors rename and "reshape"/"permute" reverse logic.

Since vLLM and transformers have a different model weights loading logic, it wiil cost some time for me to figure out a better method to finish these works. So I marked this PR as a draft.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@Rocketknight1
Copy link
Member

@SunMarc @MekkCyber for quantization - but let me know if someone else is handling the GGUF integration and I'll ping them instead in future!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
@SunMarc
Copy link
Member

SunMarc commented Dec 5, 2024

Feel free to ping @Isotr0py and me for anything related to gguf @Rocketknight1 !
Let me know when this is ready to be reviewed @Isotr0py

@MekkCyber
Copy link
Contributor

MekkCyber commented Dec 5, 2024

@Rocketknight1 I'm diving into the code of gguf integration to fix some issues so feel free to ping me as well for related issues and PRs

Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
@Isotr0py Isotr0py changed the title [WIP] Refactor and decouple gguf checkpoint loading logic [GGUF] Refactor and decouple gguf checkpoint loading logic Dec 10, 2024
@Isotr0py Isotr0py marked this pull request as ready for review December 10, 2024 11:19
@Isotr0py
Copy link
Collaborator Author

Isotr0py commented Dec 10, 2024

BTW, since the GGUF tests are very slow, I used a script to do quick tests for GGUF tensors renaming (which is modified in this PR):

import torch
from huggingface_hub import hf_hub_download

from transformers import AutoConfig, AutoModelForCausalLM
from transformers.modeling_gguf_pytorch_utils import load_gguf_checkpoint


model_id = "duyntnet/Qwen1.5-MoE-A2.7B-Chat-imatrix-GGUF"
gguf_model_id = "Qwen1.5-MoE-A2.7B-Chat-IQ1_M.gguf"

config = AutoConfig.from_pretrained(model_id, gguf_file=gguf_model_id)
with torch.device("meta"):
    model = AutoModelForCausalLM.from_config(config)
    expected_keys = set(model.state_dict().keys())

    gguf_path = hf_hub_download(model_id, filename=gguf_model_id)
    exact_keys = set(load_gguf_checkpoint(gguf_path, True, model)["tensors"].keys())

assert (
    expected_keys == exact_keys
), f"Following parameters are not initialized from GGUF file: {expected_keys - exact_keys}"

Signed-off-by: Isotr0py <[email protected]>
Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ! Thanks for this nice cleanup. It will make the maintenance + adding support to new models way easier. Did you check that if the tests are passing locally or not ? We can also trigger the slow tests if you prefer.

@MekkCyber
Copy link
Contributor

Looks awesome ! More readable than it used to be ! Thanks for the clean up 🔥

Comment on lines +300 to +305
if named_children := hf_model.named_children():
for name, child in named_children:
sub_map = get_gguf_hf_weights_map(child, model_type, num_layers, qual_name=f"{qual_name}{name}.")
# Ignore the keys that are already in the main map to avoid overwriting
sub_map = {k: v for k, v in sub_map.items() if k not in gguf_to_hf_name_map}
gguf_to_hf_name_map.update(sub_map)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the whole recursion depth or only one layer of depth ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think depth=1 might be not enough, because some vision models based on AutoModelForVision2Seq might use AutoModelForCausalLM as language backbone. If they use BloomForCausalLM as backbone, this will require depth=2. (Although we won't face this case currently)

Since there is no significant slowdown in model loading between whole depth and depth=1, I think we can keep whole depth for code simplicity :)

BTW, about the model loading performance, the root bottleneck is the GGUFReader itself, because its implementation uses numpy memmap for metadata extraction incorrectly, which cause a terrible slowdown. 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay thanks for the clarifications @Isotr0py 🔥

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the GGUFReader if you think there is a problem in the way it's handled I think it would be nice to open an issue in the gguf repo 😊

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I have opened a PR in llama.cpp to improve the performance of GGUFReader: ggml-org/llama.cpp#10159 😄

@Isotr0py
Copy link
Collaborator Author

I have run slow tests for all architectures we supported (except falcon-40b) locally, and most of them have passed.

Only Nemotron is failling because its tokenizer failed to initialize, it also failed on the main branch with same reason. So I think it's not related to this PR, and we need to fix it in a following PR.

(Or we should re-organize the GGUF tests, because it's too messy 😅)

@MekkCyber
Copy link
Contributor

Yes the nemotron integration fails because of the tokenizer, I'm working on it

@SunMarc SunMarc requested a review from ArthurZucker December 23, 2024 15:59
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM sorry for the delay, Christmas and new year vacation hit me! 🤗

where N signifies the block number of a layer, and BB signifies the
attention/mlp layer components.
See "Standardized tensor names" in
https://github.com/ggerganov/ggml/blob/master/docs/gguf.md for details.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
https://github.com/ggerganov/ggml/blob/master/docs/gguf.md for details.
https://github.com/ggerganov/ggml/blob/master/docs/gguf.md for details. We rely on the `get_tensor_name_map` to get the correct mapping!

@ArthurZucker ArthurZucker merged commit 3951da1 into huggingface:main Jan 6, 2025
25 checks passed
AlanPonnachan pushed a commit to AlanPonnachan/transformers that referenced this pull request Jan 7, 2025
…ce#34385)

* draft load_gguf refactor

* update

Signed-off-by: Isotr0py <[email protected]>

* remove llama mapping

Signed-off-by: Isotr0py <[email protected]>

* remove qwen2 mapping

Signed-off-by: Isotr0py <[email protected]>

* remove unused function

Signed-off-by: Isotr0py <[email protected]>

* deprecate stablelm mapping

Signed-off-by: Isotr0py <[email protected]>

* deprecate phi3 mapping

Signed-off-by: Isotr0py <[email protected]>

* deprecate t5 mapping

Signed-off-by: Isotr0py <[email protected]>

* deprecate bloom mapping

Signed-off-by: Isotr0py <[email protected]>

* fix bloom

Signed-off-by: Isotr0py <[email protected]>

* deprecate starcoder2 mapping

Signed-off-by: Isotr0py <[email protected]>

* deprecate gpt2 mapping

Signed-off-by: Isotr0py <[email protected]>

* deprecate mistral mapping

Signed-off-by: Isotr0py <[email protected]>

* deprecate nemotron mapping

Signed-off-by: Isotr0py <[email protected]>

* deprecate mamba mapping

Signed-off-by: Isotr0py <[email protected]>

* deprecate mamba mapping

Signed-off-by: Isotr0py <[email protected]>

* code format

Signed-off-by: Isotr0py <[email protected]>

* code format

Signed-off-by: Isotr0py <[email protected]>

* fix mamba

Signed-off-by: Isotr0py <[email protected]>

* fix qwen2moe

Signed-off-by: Isotr0py <[email protected]>

* remove qwen2moe mapping

Signed-off-by: Isotr0py <[email protected]>

* clean up

Signed-off-by: Isotr0py <[email protected]>

* remove falcon 7b map

Signed-off-by: Isotr0py <[email protected]>

* remove all ggml tensors mapping

Signed-off-by: Isotr0py <[email protected]>

* add comments

Signed-off-by: Isotr0py <[email protected]>

* update messages

Signed-off-by: Isotr0py <[email protected]>

* fix tensors in parsed parameters

Signed-off-by: Isotr0py <[email protected]>

* add gguf check

Signed-off-by: Isotr0py <[email protected]>

---------

Signed-off-by: Isotr0py <[email protected]>
@Isotr0py Isotr0py deleted the gguf-tensor-map branch January 21, 2025 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants