-
Notifications
You must be signed in to change notification settings - Fork 88
feats(transformers):add deepseek_v3 model #1415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @iugoood, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for the deepseek_v3 model, which appears to be based on the DeepSeek-V2 architecture. The changes are well-structured, including updates to auto-configuration classes, the model implementation, and a new test suite. My review focuses on the new model implementation in modeling_deepseek_v3.py. I've identified a few key areas for improvement, primarily concerning performance. The Mixture-of-Experts (MoE) layer and the interleaved rotary position embeddings are currently implemented in a way that could lead to significant performance bottlenecks. Additionally, the example code in the model's docstring uses an invalid model identifier, which should be corrected to ensure it's useful for users.
| def moe(self, hidden_states: mindspore.Tensor, topk_indices: mindspore.Tensor, topk_weights: mindspore.Tensor): | ||
| r""" | ||
| CALL FOR CONTRIBUTION! I don't have time to optimise this right now, but expert weights need to be fused | ||
| to not have to do a loop here (deepseek has 256 experts soooo yeah). | ||
| """ | ||
| final_hidden_states = mindspore.mint.zeros_like(hidden_states, dtype=topk_weights.dtype) | ||
| expert_mask = mindspore.mint.nn.functional.one_hot(topk_indices, num_classes=len(self.experts)) | ||
| expert_mask = expert_mask.permute(2, 0, 1) | ||
|
|
||
| for expert_idx in range(len(self.experts)): | ||
| expert = self.experts[expert_idx] | ||
| mask = expert_mask[expert_idx] | ||
| token_indices, weight_indices = mindspore.mint.where(mask) | ||
|
|
||
| if token_indices.numel() > 0: | ||
| expert_weights = topk_weights[token_indices, weight_indices] | ||
| expert_input = hidden_states[token_indices] | ||
| expert_output = expert(expert_input) | ||
| weighted_output = expert_output * expert_weights.unsqueeze(-1) | ||
| final_hidden_states.index_add_(0, token_indices, weighted_output) | ||
|
|
||
| # in original deepseek, the output of the experts are gathered once we leave this module | ||
| # thus the moe module is itelsf an IsolatedParallel module | ||
| # and all expert are "local" meaning we shard but we don't gather | ||
| return final_hidden_states.type(hidden_states.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation of the moe method iterates over each expert in a Python for loop. As acknowledged by the "CALL FOR CONTRIBUTION" comment, this is inefficient and will be a significant performance bottleneck, especially with a large number of experts. For better performance, this loop should be replaced with vectorized operations. A common optimization pattern for MoE layers is to use gather and scatter operations to group tokens by their assigned expert and then perform computations in a batched manner.
| def apply_rotary_pos_emb_interleave(q, k, cos, sin, position_ids=None, unsqueeze_dim=1): | ||
| r""" | ||
| TODO let's just use the original freqcis computation to not have the view | ||
| transpose + reshape! This is not optimized! | ||
| Applies Rotary Position Embedding to the query and key tensors. | ||
|
|
||
| Args: | ||
| q (`mindspore.Tensor`): The query tensor. | ||
| k (`mindspore.Tensor`): The key tensor. | ||
| cos (`mindspore.Tensor`): The cosine part of the rotary embedding. | ||
| sin (`mindspore.Tensor`): The sine part of the rotary embedding. | ||
| position_ids (`mindspore.Tensor`): | ||
| The position indices of the tokens corresponding to the query and key tensors. For example, this can be | ||
| used to pass offsetted position ids when working with a KV-cache. | ||
| unsqueeze_dim (`int`, *optional*, defaults to 1): | ||
| The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and | ||
| sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note | ||
| that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and | ||
| k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes | ||
| cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have | ||
| the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2. | ||
| Returns: | ||
| `tuple(mindspore.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding. | ||
| """ | ||
| cos = cos.unsqueeze(unsqueeze_dim) | ||
| sin = sin.unsqueeze(unsqueeze_dim) | ||
|
|
||
| b, h, s, d = q.shape | ||
| q = q.view(b, h, s, d // 2, 2).transpose(4, 3).reshape(b, h, s, d) | ||
|
|
||
| b, h, s, d = k.shape | ||
| k = k.view(b, h, s, d // 2, 2).transpose(4, 3).reshape(b, h, s, d) | ||
|
|
||
| q_embed = (q * cos) + (rotate_half(q) * sin) | ||
| k_embed = (k * cos) + (rotate_half(k) * sin) | ||
| return q_embed, k_embed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The apply_rotary_pos_emb_interleave function, as noted by the TODO comment, is not optimized. It uses several view, transpose, and reshape operations to reorder the query and key tensors before applying rotary embeddings. These data layout transformations are computationally expensive and should be avoided. Consider refactoring this function to compute the rotary embeddings more directly on the interleaved layout to improve performance.
| >>> model = DeepseekV3ForCausalLM.from_pretrained("meta-deepseek_v3/DeepseekV3-2-7b-hf") | ||
| >>> tokenizer = AutoTokenizer.from_pretrained("meta-deepseek_v3/DeepseekV3-2-7b-hf") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example in the DeepseekV3ForCausalLM docstring uses the model identifier "meta-deepseek_v3/DeepseekV3-2-7b-hf", which does not exist on the Hugging Face Hub and seems to be a result of a copy-paste error. This makes the example unusable and confusing for users. Please update it to use a valid public model identifier for a DeepSeek model, for example, one from the deepseek-ai organization.
| >>> model = DeepseekV3ForCausalLM.from_pretrained("meta-deepseek_v3/DeepseekV3-2-7b-hf") | |
| >>> tokenizer = AutoTokenizer.from_pretrained("meta-deepseek_v3/DeepseekV3-2-7b-hf") | |
| >>> model = DeepseekV3ForCausalLM.from_pretrained("deepseek-ai/DeepSeek-V2-Lite-Base") | |
| >>> tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V2-Lite-Base") |
35188da to
9de9b8d
Compare
Add
1 add deepseek_v3 model
2 add UT
ps: Quantitative weights cannot be validated.
Usage
Performance
Experiments are tested on Ascend Atlas 800T A2 machines with mindspore 2.6.0.