-
Notifications
You must be signed in to change notification settings - Fork 306
Generated GPT_OSS model files through porter script. #2384
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new GPT-OSS model, seemingly generated by a script. While the overall structure aligns with the repository's conventions, there are several critical issues that need to be addressed. The weight conversion scripts in both utils/transformers and tools/checkpoint_conversion are non-functional due to incorrect layer and weight names, likely a result of being copied from another model's implementation without proper adaptation. Additionally, there's a critical code duplication issue with the GptOssCausalLMPreprocessor class being defined in two separate files. Several style guide violations were also found, including the use of type hints in function signatures, incorrect docstring formatting, and missing docstrings or examples in several new components. These issues must be resolved to ensure the model is correct, maintainable, and usable.
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces the GPT_OSS model, including its backbone, causal language model task, preprocessor, and tokenizer. The overall structure aligns well with the project's conventions. However, there are several areas that need attention.
Most critically, the Hugging Face weight conversion script in convert_gpt_oss.py is incorrect and incompatible with the implemented model architecture, particularly for the Mixture-of-Experts (MoE) layers. This will prevent loading pre-trained weights correctly. I've provided a corrected implementation for the convert_weights function.
Additionally, there are multiple instances of incorrect docstring formatting across the new files, which violates the project's style guide. I've pointed these out with suggestions for correction. A minor but important clarification is also needed in a test file comment to improve maintainability.
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new GptOss model, including its backbone, causal language model, preprocessor, and associated tests and conversion scripts. The code is script-generated, and this review focuses on its quality, correctness, and adherence to the project's style guide.
The model definition and tests are generally well-structured. However, there are a few issues in the GptOssBackbone implementation regarding initialization and configuration that need to be addressed. The most critical issue lies in the Hugging Face conversion script (convert_gpt_oss.py), which appears to contain significant copy-paste errors from another model's converter. This will prevent correct weight loading and needs a substantial revision. My detailed comments provide specific suggestions to fix these issues.
|
@laxmareddyp does it work? Looks like it probably doesn't even output valid code yet right? Going by the test output. The overall code style looks ok, so this will come down to how accurate the code is. I would view scripts like this as a developer tool rather than an automated workflow. So maybe the thing to do here is to try getting this code end to end working. Then you can have some first hand experience on how useful this was at saving time. If there's tons of hard to find errors in the code, this might be slower than doing it yourself. If it's pretty accurate, maybe this is saving time and worth putting forward as a tool for contributors to use. |
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new GptOss model, which is a significant contribution. The code is mostly well-structured and follows the project's conventions. However, I've found a few critical issues, primarily related to the tokenizer usage in tests and a bug in the weight conversion logic for the MoE layers. There are also some inconsistencies in the checkpoint conversion script and a minor point of confusion in a test file's comments. Addressing these points will be crucial for the model's correctness and maintainability.
keras_hub/src/models/gpt_oss/gpt_oss_causal_lm_preprocessor_test.py
Outdated
Show resolved
Hide resolved
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new GPT-OSS model, which is a GPT-style transformer with Mixture of Experts, generated via a script. The changes are comprehensive, including the backbone, causal LM task, preprocessor, tokenizer, and a conversion script from Hugging Face format. Additionally, the RotaryEmbedding layer has been enhanced to support YaRN scaling.
My review confirms that the new model implementation generally follows the repository's established patterns and style guide. However, I've identified a few areas for improvement. Specifically, I've noted a potential issue in the RotaryEmbedding layer's state management and opportunities to improve code clarity. I also found that several docstrings are missing required usage examples as mandated by the repository's style guide. The rest of the implementation, including the new model architecture and tests, appears solid.
|
|
||
| # Store original axis values for validation | ||
| self._original_sequence_axis = sequence_axis | ||
| self._original_feature_axis = feature_axis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The self.built = True statement was removed from __init__. For layers that do not have weights that need to be built in a build() method, it's important to set self.built = True at the end of __init__ to indicate to the framework that the layer is already built. Please add it back to ensure correct layer state.
| self._original_feature_axis = feature_axis | |
| self._original_feature_axis = feature_axis | |
| self.built = True |
| # Find correction dimensions for beta_fast and beta_slow | ||
| def find_correction_dim_tensor( | ||
| num_rotations, dim, base_val, max_pos | ||
| ): | ||
| return ( | ||
| dim | ||
| * ops.log(max_pos / (num_rotations * 2 * 3.141592653589793)) | ||
| ) / (2 * ops.log(base_val)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value of pi is hardcoded here. It's better to use math.pi for precision and readability. You'll need to add import math at the top of the file.
Additionally, the nested function find_correction_dim_tensor does not depend on any instance state and could be defined as a static method on the class or a helper function outside the class for better code organization and to avoid potential JIT compilation issues.
def find_correction_dim_tensor(
num_rotations, dim, base_val, max_pos
):
return (
dim
* ops.log(max_pos / (num_rotations * 2 * math.pi))
) / (2 * ops.log(base_val))| """A cached attention layer with sliding window and sink tokens. | ||
| This layer implements the attention mechanism described in the GPT-OSS | ||
| paper. It includes grouped-query attention, rotary position embeddings, | ||
| sliding window attention, and sink tokens for improved performance on | ||
| long sequences. | ||
| Args: | ||
| num_query_heads (int): The number of query attention heads. | ||
| num_key_value_heads (int): The number of key and value attention | ||
| heads. | ||
| rope_max_wavelength (int, optional): The maximum wavelength for the | ||
| rotary position embedding. Defaults to 10000. | ||
| rope_scaling_factor (float, optional): The scaling factor for the | ||
| rotary position embedding. Defaults to 1.0. | ||
| kernel_initializer (str, optional): The initializer for the kernel | ||
| weights. Defaults to "glorot_uniform". | ||
| sliding_window (int, optional): The size of the sliding window. | ||
| Defaults to 4096. | ||
| dropout (float, optional): The dropout rate. Defaults to 0. | ||
| head_dim (int, optional): Head dimension for attention. If None, | ||
| calculated as hidden_dim // num_query_heads. Defaults to None. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| """An end-to-end GptOss model for causal language modeling. | ||
| A causal language model (LM) predicts the next token based on previous | ||
| tokens. This task setup can be used to train the model unsupervised on | ||
| plain text input, or to autoregressively generate plain text similar to | ||
| the data used for training. This task can be used for pre-training or | ||
| fine-tuning a GptOss model, simply by calling `fit()`. | ||
| This model has a `generate()` method, which generates text based on a | ||
| prompt. The generation strategy used is controlled by an additional | ||
| `sampler` argument on `compile()`. You can recompile the model with | ||
| different `keras_hub.samplers` objects to control the generation. By | ||
| default, `"top_k"` sampling will be used. | ||
| Args: | ||
| backbone: A `keras_hub.models.GptOssBackbone` instance. | ||
| preprocessor: A `keras_hub.models.GptOssCausalLMPreprocessor` or | ||
| `None`. If `None`, this model will not apply preprocessing, and | ||
| inputs should be preprocessed before calling the model. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| """A GptOss tokenizer using BytePair encoding. | ||
| Tokenizer is a subclass of `keras_hub.tokenizers.BytePairTokenizer`. | ||
| It uses a BytePair encoding model to tokenize strings. It also adds special | ||
| tokens for the start and end of a sequence. | ||
| Args: | ||
| vocabulary: string or dict, maps token to integer ids. If it is a | ||
| string, it should be the file path to a json file. | ||
| merges: string or list, contains the merge rule. If it is a string, | ||
| it should be the file path to merge rules. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| @@ -0,0 +1,351 @@ | |||
| # Copyright 2024 The KerasHub Authors | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove copyright banner
| long sequences. | ||
| Args: | ||
| num_query_heads (int): The number of query attention heads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
follow type hints format from other files in the repo.
arg_name: type_hint. description
| # b = batch size | ||
| # q = query length | ||
| # k = key/value length | ||
| # m = model dim |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is model dim?
| key = self.rotary_embedding_layer(key, start_index=start_index) | ||
| return key, value | ||
|
|
||
| if cache is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cache logic for KerasHub is located in causal_lm file.
example:
| def _build_cache(self, token_ids): |
| @@ -0,0 +1,230 @@ | |||
| # Copyright 2024 The KerasHub Authors | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove banner here and everywhere
| constructor. | ||
| Args: | ||
| vocabulary_size (int): The size of the token vocabulary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix type hit format. refer comment above
| calculation of roatary embedding. Defaults to `1.0`. | ||
| layer_norm_epsilon (float, optional): Epsilon for the layer | ||
| normalization layers in the transformer decoder. Defaults to `1e-6`. | ||
| sliding_window (int, optional): The sliding window for the attention |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the args order list in docstring should match init
| Examples: | ||
| ```python | ||
| import tensorflow as tf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why tf?
| class GptOssCausalLMTest(TestCase): | ||
| def setUp(self): | ||
| # Define vocabulary and merges inline like GPT-2 tests | ||
| self.vocab = ["!", "air", "Ġair", "plane", "Ġat", "port"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use test vocab file like - Link
|
from the PR description, I dont see the generate output matching. @laxmareddyp do you need to update the result? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have looked only part of the code and made some comments.
Based on the observation, below are some of the common points which needs addressing.
- Remove all the copyright text at the starting of the file.
- Remove obvious comments for each code logic, only add the comments where it is necessary like complex logic or workaround is mentioned.
- Avoid adding magic numbers, handle it better.
Will add more comments after I review other files.
| """GPT-OSS preset configurations.""" | ||
|
|
||
| backbone_presets = { | ||
| "gpt_oss_8_7b_en": { | ||
| "metadata": { | ||
| "description": ( | ||
| "32-layer GPT-OSS MoE model with 7 billion " | ||
| "active parameters and 8 experts per MoE layer." | ||
| ), | ||
| "params": 46702792704, | ||
| "path": "gpt_oss", | ||
| }, | ||
| "kaggle_handle": "kaggle://keras/gpt_oss/keras/gpt_oss_8_7b_en/1", | ||
| }, | ||
| "gpt_oss_instruct_8_7b_en": { | ||
| "metadata": { | ||
| "description": ( | ||
| "Instruction fine-tuned 32-layer GPT-OSS MoE model " | ||
| "with 7 billion active parameters and 8 experts per MoE layer." | ||
| ), | ||
| "params": 46702792704, | ||
| "path": "gpt_oss", | ||
| }, | ||
| "kaggle_handle": ( | ||
| "kaggle://keras/gpt_oss/keras/gpt_oss_instruct_8_7b_en/1" | ||
| ), | ||
| }, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment this part till we have the checkpoints uploaded.
|
|
||
| # Store original axis values for validation | ||
| self._original_sequence_axis = sequence_axis | ||
| self._original_feature_axis = feature_axis | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change this back to
self.sequence_axis = sequence_axis
self.feature_axis = feature_axis
To avoid the confusion with previous implementation
and add self.built = True
| self.original_max_position_embeddings = original_max_position_embeddings | ||
| self.truncate = truncate | ||
|
|
||
| # Store original axis values for validation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this comment
| if feature_axis < 0: | ||
| feature_axis += rank | ||
|
|
||
| # Validate axis indices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this comment
| ) | ||
|
|
||
| def call(self, inputs, start_index=0, positions=None): | ||
| # Normalize and validate axes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this comment
| input_shape = ops.shape(inputs) | ||
| sequence_axis, feature_axis = self._normalize_axes(input_shape) | ||
|
|
||
| # Validate rotary dimension |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this comment
| rotary_dim = input_shape[feature_axis] | ||
| self._validate_rotary_dimension(rotary_dim) | ||
|
|
||
| # Take care of unbatched `positions`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this comment
| return sequence_axis, feature_axis | ||
|
|
||
| def _validate_rotary_dimension(self, rotary_dim): | ||
| """Validate that rotary dimension is even and handle odd dimensions.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this comment
| # rotary_dim should be half of the last | ||
| # feature axis (HF-style: rotate pairs) | ||
| rotary_dim = ops.shape(inputs)[feature_axis] | ||
| # Validate evenness | ||
| try: | ||
| # best-effort check when running eagerly; | ||
| # if unavailable this will be a no-op | ||
| if int(rotary_dim) % 2 != 0: | ||
| raise ValueError( | ||
| "Rotary embedding requires even feature " | ||
| "dimension (last axis)." | ||
| ) | ||
| except Exception: | ||
| pass | ||
|
|
||
| # Get inverse frequencies using the appropriate | ||
| # scaling method (linear, dynamic, yarn, etc.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not use try except block, can't we just use _validate_rotary_dimension function?
Also, remove the comments.
| # YaRN temperature scaling: implement in tensor ops | ||
| if self.rope_type == "yarn": | ||
| # t = (0.1 * ln(s) + 1)^2 | ||
| # make sure s > 0 | ||
| small = ops.cast(1e-6, self.compute_dtype) | ||
| s_safe = ops.maximum( | ||
| ops.cast(self.scaling_factor, self.compute_dtype), small | ||
| ) | ||
| t = ops.square( | ||
| ops.add( | ||
| ops.multiply( | ||
| ops.cast(0.1, self.compute_dtype), ops.log(s_safe) | ||
| ), | ||
| ops.cast(1.0, self.compute_dtype), | ||
| ) | ||
| ) | ||
| sqrt_t = ops.sqrt(t) | ||
|
|
||
| # HF/YaRN descriptions indicate a temperature | ||
| # scaling applied to cos/sin embeddings, equivalently | ||
| # scaling the logits.We implement the sqrt scaling on cos/sin. | ||
| cos_emb = cos_emb * sqrt_t | ||
| sin_emb = sin_emb * sqrt_t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid variables names like t, small etc, give meaningful variable names.
@divyashreepathihalli @mattdangerw @abheesht17 Could you please check and provide your feedback on the quality of this code generated through script.
I assume that 80-85% the code is matching and backbone files import successfully and it's possible to instantiate a backbone model. There still were some errors , which might be alleviated with a stronger model.
The converter and weight conversion scripts are still in development. Generating a workable solution is complex because it requires providing the model with a comprehensive understanding of the entire architectural layout to handle the intricate dependencies of the model's layers and weights.
Output match:
Generated response(gpt_oss_safeguard_20b_en):
Generated response(gpt_oss_20b_en):
Checklist