-
Notifications
You must be signed in to change notification settings - Fork 29.1k
Add EoMT Model #37610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add EoMT Model #37610
Conversation
Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the |
Image segmentation model so cc @qubvel @NielsRogge! |
@qubvel A rough draft is ready for inference 🤗 . Now i am adding support for training and they use mask_annealing to determine probability for attn masks. Is it required in this HF implementation also (I don't see for any other model) or are we fine with having a fixed prob for attn mask |
Hey @yaswanth19, do you mean the mask probability changes during the model training? Would be nice to have it but not sure it can be easily implemented tbh. Maybe we can just add a callback for a Trainer to change it? similar to learning rate callback (I mean not adding it to the Transformers actually, but to the docs/fine-tuning guide) |
Yup exactly, and adding a trainer callback seems to be a good idea. I will check the feasibility of implementation and if simple enough then we can implement in the model itself else pivot to trainer callback. |
@yaswanth19 Thanks for your great work! A few thoughts: Let me know your thoughts. |
Thanks @tommiekerssies for your initial thoughts.
Yup, will add the complete test suite once I have a implementation ready.
Thanks for bringing this to my attention, I can make some correction to initialization later on when we have the end-to-end code ready. IMO, most of the user don't init from scratch and will either finetune it or just perform inference. But having said that I will look at timm implementation and will init in the same way
Ideally yes 😅 But that's not the library coding standard (Don't want to introduce a hard dependency on
Transformers has one model one file philosophy and because of that I have copied the Mask2Former loss completely here. It can be subjective call with Modular file in the sense we can expose Mask2Former loss and import it here for EoMT (Will require additional changes in mask2former) but that can be discussed during reviews with the core maintainer. |
Thanks for the clarifications! Regarding mask annealing, I agree that 0 for now is fine. That means effectively disabling masked attention and mask annealing, which is what the current code already does, so no changes needed on that front. For weight initialization and the ViT backbone, I understand the constraints around using timm. In that case, I’d just make sure that the non-ViT parameters (query embeddings, mask MLP, upscale blocks, class head) aren’t using any custom initializations and instead follow PyTorch defaults. Should be a quick fix. Let me know if you’d like me to look at any part in more detail. |
Hi @qubvel ,I’m working on refactoring the training logic for EoMT , and I’m running into a design challenge: In the original single‐class implementation, they call _predict (which uses the class predictor head) on intermediate layer outputs to build attention masks. Because everything lives in one class, this is straightforward. Refer: https://github.com/tue-mps/eomt/blob/c311b377d3189c976163e4ceb2156d90bb7db88f/models/eomt.py#L130 In our modular HF version, the encoder (EoMTEncoder) only runs the transformer blocks, and _predict (with mask_head, upscale_block, and class_predictor) lives in EoMTModel or EoMTForUniversalSegmentation. That separation means the encoder loop can’t access _predict , so we can’t reconstruct the original training flow. I have two solutions in mind, LMK your thoughts on the below approaches and suggest any other better alternative: 1.) Club all the classes from 2.) Move _predict which includes mask_head, upscale_block into EoMTEncoder and somehow pass the class_head. Pass these modules into the encoder class so that inside its forward loop it can call _predict, build the attention mask, and feed it into the next block. IMO this is a bit dirty and flow is tangled 😅 . Here the _predict func is the same code which is used in Mask2Former for get mask_logits and class_logits from model output. |
Hi @yaswanth19, great work! CLASS_MAPPING is used to remap the dataset’s class IDs to a contiguous range without gaps. INSTANCE_MAPPING is specific to ADE20K panoptic, which requires merging semantic and instance labels: we apply The Due to these differences, it might not be ideal to include dataset-specific parsing logic in the HF Transformers codebase. Instead, the preprocessor can expect targets to already be in Regarding padding: it’s not applied twice. The transform pipeline is used during training, where we apply random scale jittering, pad to square, and then crop. The Let me know if anything’s unclear! |
Hi @yaswanth19 ! Is this ready for a review? Feel free to ping me when it is! |
@qubvel @NielsRogge The PR is ready is review. I have added some Todo's for me which are independent things and not a high priority. I will do them in parallel with reviews once I find some more time. @tommiekerssies Please have a look at it if you have some bandwidth. IMO focus more the processing class because that's the module which differs from original implementation. AFAIK I have standardized the inference pre-processing correctly and LMK if you find any inconsistencies w.r.t pre and post processing. |
class EoMTLayer(nn.Module): | ||
def __init__(self, config: EoMTConfig) -> None: | ||
super().__init__() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class is very similar to DinoV2WithRegistersLayer but in dino init we have if conditoin to determine the mlp and in forward methods dino uses head mask whereas we utilize attn_mask across the model. Due to this subtle diff, I had to overwrite this class instead of using modular
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so it's not possible to leverage the AutoBackbone
class as used in DETR, Mask2Former for example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, not completely. My understanding is we can use AutoBackbone like timm model when we want to keep it as a module; then do our extra ops/processing on top of the module output. But in this case, we are directly operating on backbone itself that is Dino/vit backbone. So, IMO current implementation is better utilizing modular and inline with repo standards.
class LayerNorm2d(nn.LayerNorm): | ||
def __init__(self, num_channels, eps=1e-6, affine=True): | ||
super().__init__(num_channels, eps=eps, elementwise_affine=affine) | ||
|
||
def forward(self, hidden_state: torch.Tensor) -> torch.Tensor: | ||
hidden_state = hidden_state.permute(0, 2, 3, 1) | ||
hidden_state = F.layer_norm(hidden_state, self.normalized_shape, self.weight, self.bias, self.eps) | ||
hidden_state = hidden_state.permute(0, 3, 1, 2) | ||
return hidden_state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have seen a few places where GroupNorm with num_groups=1 is used but it was not giving equivalent logits. Hence created this layer. LMK if we can move to some other standard file so other models can use this.
Refer: apple/ml-cvnets#34
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed good catch. Let's just rename this EoMTLayerNorm2d
# ToDo: How to add gradient checkpointing to the model? | ||
@auto_docstring( | ||
custom_intro=""" | ||
The EoMT Model with heads on top for instance/semantic/panoptic segmentation. | ||
""" | ||
) | ||
class EoMTForUniversalSegmentation(EoMTPreTrainedModel): | ||
def __init__(self, config: EoMTConfig) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally we would inherit GradientCheckpointingLayer in XXXEncoder class but since here we have a single module from encoder to end and diff ops based on layer number, so will it be possible to add gradient checkpointing 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm not sure either, maybe not a priority but also interested in knowing if inheriting from GradientCheckpointingLayer
in EoMTLayer
is an issue because of the manipulations in forward
Dear @yaswanth19 , great work! I’m currently on holiday and will review on Tuesday June 3rd when I’m back at work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @yaswanth19 ! Thanks a lot for this great work. Main comments on the modeling code are on the use of modular, and on splitting the EoMTForUniversalSegmentation
model in two.
Let's also add a fast image processor please :)
config = EoMTConfig() | ||
config.image_size = config_data["image_size"] | ||
config.patch_size = config_data["patch_size"] | ||
config.num_queries = config_data["num_queries"] | ||
config.num_labels = config_data["num_labels"] | ||
config.num_blocks = config_data["num_blocks"] | ||
# With 1e-5 the test_initialization fails hence set it directly in config. | ||
config.layerscale_value = 1e-5 | ||
|
||
processor = EoMTImageProcessor() | ||
processor.size = {"height": config.image_size, "width": config.image_size} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's set the attributes when instantiating the config/processor, not after
@@ -0,0 +1,694 @@ | |||
# coding=utf-8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a fast image processor before merging this! More info here: #36978
def scale_image_size(self, image_size: Tuple[int, int], segmentation_type: str) -> Tuple[int, int]: | ||
""" | ||
Scales image dimensions based on the segmentation type. | ||
|
||
For semantic segmentation, scales up to or exceed the target size. | ||
For instance or panoptic segmentation, scales down to fit within the target size. | ||
|
||
Args: | ||
image_size (`Tuple[int, int]`): | ||
Original image size (height, width). | ||
segmentation_type (`str`): | ||
One of "semantic", "instance", or "panoptic". | ||
|
||
Returns: | ||
`Tuple[int, int]`: Scaled image size (height, width). | ||
""" | ||
target_h, target_w = self.size["height"], self.size["width"] | ||
orig_h, orig_w = image_size | ||
|
||
# For semantic segmentation: scale up so that both sides are ≥ target size | ||
if segmentation_type == "semantic": | ||
scale_factor = max(target_h / orig_h, target_w / orig_w) | ||
else: | ||
scale_factor = min(target_h / orig_h, target_w / orig_w) | ||
|
||
output_h = round(orig_h * scale_factor) | ||
output_w = round(orig_w * scale_factor) | ||
|
||
return (output_h, output_w) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big fan of changing the resize factor depending on the segmentation type. Let's leave the option to the user.
Instead of this function, you can use get_size_with_aspect_ratio
and set size to {"shortest_edge":..., "longest_edge":...}
(see how it's done for image_processing_detr for example) to get an equivalent and more explicit behavior to this.
We can make it clear in the model cards and in the docs what size dict they need to set for which task
""" | ||
image_size = get_image_size(image) | ||
|
||
output_size = self.scale_image_size(image_size, segmentation_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's have the logic with get_size_with_aspect_ratio
here instead
if segmentation_type == "semantic": | ||
for idx, img in enumerate(images): | ||
crops, origins = self._preprocessing_semantic_segmentation(img, idx) | ||
processed_images.extend(crops) | ||
crops_offset.extend(origins) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, not a fan of forcing the preprocessing depending on the task. Let's have something like a do_split_image
bool argument to preprocess
and init
functions, and rename _preprocessing_semantic_segmentation
to _split_image
return hidden_states | ||
|
||
|
||
class MaskHead(nn.Module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class MaskHead(nn.Module): | |
class EoMTMaskHead(nn.Module): |
main_input_name = "pixel_values" | ||
supports_gradient_checkpointing = False | ||
_no_split_modules = ["EoMTMLP"] | ||
_supports_sdpa = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also have _supports_flash_attn_2 = True
here I think?
# ToDo: How to add gradient checkpointing to the model? | ||
@auto_docstring( | ||
custom_intro=""" | ||
The EoMT Model with heads on top for instance/semantic/panoptic segmentation. | ||
""" | ||
) | ||
class EoMTForUniversalSegmentation(EoMTPreTrainedModel): | ||
def __init__(self, config: EoMTConfig) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm not sure either, maybe not a priority but also interested in knowing if inheriting from GradientCheckpointingLayer
in EoMTLayer
is an issue because of the manipulations in forward
sequence_output = self.layernorm(hidden_states) | ||
if output_hidden_states: | ||
all_hidden_states += (sequence_output,) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we could still split the model into an EoMTModel
and an EoMTForUniversalSegmentation
right around here no? Would make things cleaner, and we could unfold the predict
function inside the forward function of EoMTForUniversalSegmentation
to be more consistent with other implementations in the library.
|
||
|
||
@require_torch | ||
class EoMTForUniversalSegmentationIntegrationTest(unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also have end-to-end integration tests for each tasks, using the post_process functions from the processor as well
What does this PR do?
Fixes #37171 and continuation of #37392
This PR adds EoMT model to transformers as per the title suggest. There are a few differences in this implementation when compared to the original implementation.
random scale jittering, pad to square, and random crop
is used.ToDo:
test_determinism
andtest_model_outputs_equivalence
, spent some time but couldn't debug it, I will try to fix them in parallel to reviews.test_initialization
testcase is failing, need to modify init and overwrite testcase. Will push the changes along with changes for the above mentioned failing testcases.