Add EoMT Model #37610

yaswanth19 · 2025-04-18T11:12:54Z

What does this PR do?

Fixes #37171 and continuation of #37392

This PR adds EoMT model to transformers as per the title suggest. There are a few differences in this implementation when compared to the original implementation.

In the orignal implementation, a different preprocessing pipeline is used for training and inference as mentioned in one of the below comments. Precisely in training random scale jittering, pad to square, and random crop is used.

ToDo:

A finetuning tutorial which supports custom transforms for training and transformers native processing for inference or users can depend on original imp for accurate training flow training.
Two tests are failing currently (skipped them for now) 😢 namely test_determinism and test_model_outputs_equivalence, spent some time but couldn't debug it, I will try to fix them in parallel to reviews.
The init doesn't follow torch defaults, tried to use torch default init but the test_initialization testcase is failing, need to modify init and overwrite testcase. Will push the changes along with changes for the above mentioned failing testcases.

github-actions · 2025-04-18T11:13:07Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

Rocketknight1 · 2025-04-22T12:12:30Z

Image segmentation model so cc @qubvel @NielsRogge!

yaswanth19 · 2025-04-27T19:37:51Z

@qubvel A rough draft is ready for inference 🤗 . Now i am adding support for training and they use mask_annealing to determine probability for attn masks. Is it required in this HF implementation also (I don't see for any other model) or are we fine with having a fixed prob for attn mask

qubvel · 2025-04-28T08:21:09Z

Hey @yaswanth19, do you mean the mask probability changes during the model training? Would be nice to have it but not sure it can be easily implemented tbh. Maybe we can just add a callback for a Trainer to change it? similar to learning rate callback (I mean not adding it to the Transformers actually, but to the docs/fine-tuning guide)

yaswanth19 · 2025-04-28T08:26:54Z

do you mean the mask probability changes during the model training?

Yup exactly, and adding a trainer callback seems to be a good idea. I will check the feasibility of implementation and if simple enough then we can implement in the model itself else pivot to trainer callback.

src/transformers/models/eomt/image_processing_eomt.py

tommiekerssies · 2025-04-30T12:05:34Z

@yaswanth19 Thanks for your great work!

A few thoughts:
• Mask annealing: Setting it to a fixed probability of 1 requires masked attention during inference, which we want to avoid. A fixed probability of 0 removes masked attention but does hurt performance. If mask annealing isn’t feasible right now, I’d suggest setting it to 0 for now. If you do implement it, note that intermediate mask predictions will be required, so the _predict method might need to move into EoMTEncoder. Just make sure intermediate masks are skipped at inference to avoid slowing things down.
• Testing: It might be good to add an integration test that verifies both the mask and class outputs for semantic, instance, and panoptic segmentation.
• Weight initialization: The current approach likely misinitializes parameters like query embeddings (e.g. std=0.02). All parameters outside the ViT should follow PyTorch defaults (e.g. nn.Embedding uses std=1.0).
• ViT backbone: Would it be simpler to use timm for the ViT? This allows loading pre-trained weights when training EoMT from scratch, avoids unnecessary code, and avoids unsupported use cases like training from a randomly initialized ViT.
• Loss function: Could we reuse the Mask2Former loss? Unless it conflicts with Transformers guidelines, this might reduce redundancy.

Let me know your thoughts.

yaswanth19 · 2025-04-30T12:51:28Z

Thanks @tommiekerssies for your initial thoughts.

Mask annealing: Setting it to a fixed probability of 1 requires masked attention during inference, which we want to avoid.

I am not sure abut mask annealing implementation compatibility with transformers natively. As I have said above, in the worst case we can set to 0 or use tariner callback if that's feasible.

Testing: It might be good to add an integration test

Yup, will add the complete test suite once I have a implementation ready.

Weight initialization: The current approach likely misinitializes parameters like query embeddings

Thanks for bringing this to my attention, I can make some correction to initialization later on when we have the end-to-end code ready. IMO, most of the user don't init from scratch and will either finetune it or just perform inference. But having said that I will look at timm implementation and will init in the same way

ViT backbone: Would it be simpler to use timm for the ViT?

Ideally yes 😅 But that's not the library coding standard (Don't want to introduce a hard dependency on timm). Also using timm backbone directly will not be compatible with all other features that HF ecosystem provides IMO. I am actually referring the HF VIT and timm implementation to get the best of both worlds and as to not introduce any bug.

Loss function: Could we reuse the Mask2Former loss?

Transformers has one model one file philosophy and because of that I have copied the Mask2Former loss completely here. It can be subjective call with Modular file in the sense we can expose Mask2Former loss and import it here for EoMT (Will require additional changes in mask2former) but that can be discussed during reviews with the core maintainer.

tommiekerssies · 2025-04-30T13:46:59Z

Thanks @tommiekerssies for your initial thoughts.

Mask annealing: Setting it to a fixed probability of 1 requires masked attention during inference, which we want to avoid.

I am not sure abut mask annealing implementation compatibility with transformers natively. As I have said above, in the worst case we can set to 0 or use tariner callback if that's feasible.

Testing: It might be good to add an integration test

Yup, will add the complete test suite once I have a implementation ready.

Weight initialization: The current approach likely misinitializes parameters like query embeddings

Thanks for bringing this to my attention, I can make some correction to initialization later on when we have the end-to-end code ready. IMO, most of the user don't init from scratch and will either finetune it or just perform inference. But having said that I will look at timm implementation and will init in the same way

ViT backbone: Would it be simpler to use timm for the ViT?

Ideally yes 😅 But that's not the library coding standard (Don't want to introduce a hard dependency on timm). Also using timm backbone directly will not be compatible with all other features that HF ecosystem provides IMO. I am actually referring the HF VIT and timm implementation to get the best of both worlds and as to not introduce any bug.

Loss function: Could we reuse the Mask2Former loss?

Transformers has one model one file philosophy and because of that I have copied the Mask2Former loss completely here.

Thanks for the clarifications!

Regarding mask annealing, I agree that 0 for now is fine. That means effectively disabling masked attention and mask annealing, which is what the current code already does, so no changes needed on that front.

For weight initialization and the ViT backbone, I understand the constraints around using timm. In that case, I’d just make sure that the non-ViT parameters (query embeddings, mask MLP, upscale blocks, class head) aren’t using any custom initializations and instead follow PyTorch defaults. Should be a quick fix.

Let me know if you’d like me to look at any part in more detail.

yaswanth19 · 2025-05-01T07:48:46Z

Hi @qubvel ,I’m working on refactoring the training logic for EoMT , and I’m running into a design challenge:

In the original single‐class implementation, they call _predict (which uses the class predictor head) on intermediate layer outputs to build attention masks. Because everything lives in one class, this is straightforward.

Refer: https://github.com/tue-mps/eomt/blob/c311b377d3189c976163e4ceb2156d90bb7db88f/models/eomt.py#L130

In our modular HF version, the encoder (EoMTEncoder) only runs the transformer blocks, and _predict (with mask_head, upscale_block, and class_predictor) lives in EoMTModel or EoMTForUniversalSegmentation. That separation means the encoder loop can’t access _predict , so we can’t reconstruct the original training flow.

I have two solutions in mind, LMK your thoughts on the below approaches and suggest any other better alternative:

1.) Club all the classes from EoMTEncoder into EomtForUniversalSegmentation, in this ways we can do all the processing in a single forward class.

2.) Move _predict which includes mask_head, upscale_block into EoMTEncoder and somehow pass the class_head. Pass these modules into the encoder class so that inside its forward loop it can call _predict, build the attention mask, and feed it into the next block. IMO this is a bit dirty and flow is tangled 😅 .

Here the _predict func is the same code which is used in Mask2Former for get mask_logits and class_logits from model output.

tommiekerssies · 2025-05-19T13:07:03Z

Hi @tommiekerssies , Sorry for the delay - was occupied with other personal work - I will fix the initialization part 🤗. The model is ready and logits are matching as of now but code still needs some refactoring. The image processing is also ready but right now it doesn't support targets (masks and labels) and regarding that I have a few questions.

Can you explain what is the CLASS_MAPPING and INSTANCE_MAPPING in each xx_dataset.py files. Also it would be great if you could explain how are the mask and class labels extracted specifically the logic of target_parser function. It's quite different for each dataset class and I am having hard time on getting a common/general logic intuition to implement.

Another query is we are padding image two times - once in transforms and once again in function resize_and_pad_imgs_instance_panoptic. Why is it so 🤔

Hi @yaswanth19, great work!

CLASS_MAPPING is used to remap the dataset’s class IDs to a contiguous range without gaps. INSTANCE_MAPPING is specific to ADE20K panoptic, which requires merging semantic and instance labels: we apply CLASS_MAPPING to the semantic labels (skipping “thing” classes) and INSTANCE_MAPPING to the instance labels (which include “thing” classes).

The target_parser function is dataset-specific. It converts labels from the dataset’s original format into the consistent (masks, labels) format expected by our training/evaluation code. For example, COCO instance annotations use RLE in JSON, while COCO panoptic labels are stored as PNGs. There are also quirks like COCO’s is_crowd flag, which needs special handling (ignored during training, but used for proper AP/PQ evaluation).

Due to these differences, it might not be ideal to include dataset-specific parsing logic in the HF Transformers codebase. Instead, the preprocessor can expect targets to already be in (masks, labels) format, leaving the conversion to users. What do you think?

Regarding padding: it’s not applied twice. The transform pipeline is used during training, where we apply random scale jittering, pad to square, and then crop. The resize_and_pad_imgs_instance_panoptic function is used only during evaluation, where we resize the long side to the input size and pad the short side to make the image square.

Let me know if anything’s unclear!

yonigozlan · 2025-05-20T21:33:46Z

Hi @yaswanth19 ! Is this ready for a review? Feel free to ping me when it is!

yaswanth19 · 2025-05-25T13:46:45Z

@qubvel @NielsRogge The PR is ready is review. I have added some Todo's for me which are independent things and not a high priority. I will do them in parallel with reviews once I find some more time.

@tommiekerssies Please have a look at it if you have some bandwidth. IMO focus more the processing class because that's the module which differs from original implementation. AFAIK I have standardized the inference pre-processing correctly and LMK if you find any inconsistencies w.r.t pre and post processing.

yaswanth19 · 2025-05-25T13:48:59Z

src/transformers/models/eomt/modular_eomt.py

+class EoMTLayer(nn.Module):
+    def __init__(self, config: EoMTConfig) -> None:
+        super().__init__()


This class is very similar to DinoV2WithRegistersLayer but in dino init we have if conditoin to determine the mlp and in forward methods dino uses head mask whereas we utilize attn_mask across the model. Due to this subtle diff, I had to overwrite this class instead of using modular

Ok so it's not possible to leverage the AutoBackbone class as used in DETR, Mask2Former for example?

IMO, not completely. My understanding is we can use AutoBackbone like timm model when we want to keep it as a module; then do our extra ops/processing on top of the module output. But in this case, we are directly operating on backbone itself that is Dino/vit backbone. So, IMO current implementation is better utilizing modular and inline with repo standards.

yaswanth19 · 2025-05-25T13:50:58Z

src/transformers/models/eomt/modular_eomt.py

+class LayerNorm2d(nn.LayerNorm):
+    def __init__(self, num_channels, eps=1e-6, affine=True):
+        super().__init__(num_channels, eps=eps, elementwise_affine=affine)
+
+    def forward(self, hidden_state: torch.Tensor) -> torch.Tensor:
+        hidden_state = hidden_state.permute(0, 2, 3, 1)
+        hidden_state = F.layer_norm(hidden_state, self.normalized_shape, self.weight, self.bias, self.eps)
+        hidden_state = hidden_state.permute(0, 3, 1, 2)
+        return hidden_state


I have seen a few places where GroupNorm with num_groups=1 is used but it was not giving equivalent logits. Hence created this layer. LMK if we can move to some other standard file so other models can use this.

Refer: apple/ml-cvnets#34

Indeed good catch. Let's just rename this EoMTLayerNorm2d

yaswanth19 · 2025-05-25T13:53:16Z

src/transformers/models/eomt/modular_eomt.py

+# ToDo: How to add gradient checkpointing to the model?
+@auto_docstring(
+    custom_intro="""
+    The EoMT Model with heads on top for instance/semantic/panoptic segmentation.
+    """
+)
+class EoMTForUniversalSegmentation(EoMTPreTrainedModel):
+    def __init__(self, config: EoMTConfig) -> None:


Normally we would inherit GradientCheckpointingLayer in XXXEncoder class but since here we have a single module from encoder to end and diff ops based on layer number, so will it be possible to add gradient checkpointing 🤔

Hmmm not sure either, maybe not a priority but also interested in knowing if inheriting from GradientCheckpointingLayer in EoMTLayer is an issue because of the manipulations in forward

tommiekerssies · 2025-05-26T15:35:13Z

Dear @yaswanth19 , great work! I’m currently on holiday and will review on Tuesday June 3rd when I’m back at work.

yonigozlan

Hey @yaswanth19 ! Thanks a lot for this great work. Main comments on the modeling code are on the use of modular, and on splitting the EoMTForUniversalSegmentation model in two.
Let's also add a fast image processor please :)

yonigozlan · 2025-05-27T22:25:39Z

src/transformers/models/eomt/convert_eomt_to_hf.py

+    config = EoMTConfig()
+    config.image_size = config_data["image_size"]
+    config.patch_size = config_data["patch_size"]
+    config.num_queries = config_data["num_queries"]
+    config.num_labels = config_data["num_labels"]
+    config.num_blocks = config_data["num_blocks"]
+    # With 1e-5 the test_initialization fails hence set it directly in config.
+    config.layerscale_value = 1e-5
+
+    processor = EoMTImageProcessor()
+    processor.size = {"height": config.image_size, "width": config.image_size}


let's set the attributes when instantiating the config/processor, not after

yonigozlan · 2025-05-27T22:27:41Z

src/transformers/models/eomt/image_processing_eomt.py

@@ -0,0 +1,694 @@
+# coding=utf-8


Let's add a fast image processor before merging this! More info here: #36978

yonigozlan · 2025-05-27T23:02:03Z

src/transformers/models/eomt/image_processing_eomt.py

+    def scale_image_size(self, image_size: Tuple[int, int], segmentation_type: str) -> Tuple[int, int]:
+        """
+        Scales image dimensions based on the segmentation type.
+
+        For semantic segmentation, scales up to or exceed the target size.
+        For instance or panoptic segmentation, scales down to fit within the target size.
+
+        Args:
+            image_size (`Tuple[int, int]`):
+                Original image size (height, width).
+            segmentation_type (`str`):
+                One of "semantic", "instance", or "panoptic".
+
+        Returns:
+            `Tuple[int, int]`: Scaled image size (height, width).
+        """
+        target_h, target_w = self.size["height"], self.size["width"]
+        orig_h, orig_w = image_size
+
+        # For semantic segmentation: scale up so that both sides are ≥ target size
+        if segmentation_type == "semantic":
+            scale_factor = max(target_h / orig_h, target_w / orig_w)
+        else:
+            scale_factor = min(target_h / orig_h, target_w / orig_w)
+
+        output_h = round(orig_h * scale_factor)
+        output_w = round(orig_w * scale_factor)
+
+        return (output_h, output_w)


Not a big fan of changing the resize factor depending on the segmentation type. Let's leave the option to the user.
Instead of this function, you can use get_size_with_aspect_ratio and set size to {"shortest_edge":..., "longest_edge":...} (see how it's done for image_processing_detr for example) to get an equivalent and more explicit behavior to this.

We can make it clear in the model cards and in the docs what size dict they need to set for which task

yonigozlan · 2025-05-27T23:03:08Z

src/transformers/models/eomt/image_processing_eomt.py

+        """
+        image_size = get_image_size(image)
+
+        output_size = self.scale_image_size(image_size, segmentation_type)


Let's have the logic with get_size_with_aspect_ratio here instead

yonigozlan · 2025-05-27T23:05:53Z

src/transformers/models/eomt/image_processing_eomt.py

+        if segmentation_type == "semantic":
+            for idx, img in enumerate(images):
+                crops, origins = self._preprocessing_semantic_segmentation(img, idx)
+                processed_images.extend(crops)
+                crops_offset.extend(origins)


Same here, not a fan of forcing the preprocessing depending on the task. Let's have something like a do_split_image bool argument to preprocess and init functions, and rename _preprocessing_semantic_segmentation to _split_image

yonigozlan · 2025-05-27T23:32:28Z

src/transformers/models/eomt/modular_eomt.py

+        return hidden_states
+
+
+class MaskHead(nn.Module):


Suggested change

class MaskHead(nn.Module):

class EoMTMaskHead(nn.Module):

yonigozlan · 2025-05-27T23:33:32Z

src/transformers/models/eomt/modular_eomt.py

+    main_input_name = "pixel_values"
+    supports_gradient_checkpointing = False
+    _no_split_modules = ["EoMTMLP"]
+    _supports_sdpa = True


We should also have _supports_flash_attn_2 = True here I think?

yonigozlan · 2025-05-27T23:36:01Z

src/transformers/models/eomt/modular_eomt.py

+# ToDo: How to add gradient checkpointing to the model?
+@auto_docstring(
+    custom_intro="""
+    The EoMT Model with heads on top for instance/semantic/panoptic segmentation.
+    """
+)
+class EoMTForUniversalSegmentation(EoMTPreTrainedModel):
+    def __init__(self, config: EoMTConfig) -> None:


Hmmm not sure either, maybe not a priority but also interested in knowing if inheriting from GradientCheckpointingLayer in EoMTLayer is an issue because of the manipulations in forward

yonigozlan · 2025-05-27T23:42:24Z

src/transformers/models/eomt/modular_eomt.py

+        sequence_output = self.layernorm(hidden_states)
+        if output_hidden_states:
+            all_hidden_states += (sequence_output,)
+


Looks like we could still split the model into an EoMTModel and an EoMTForUniversalSegmentation right around here no? Would make things cleaner, and we could unfold the predict function inside the forward function of EoMTForUniversalSegmentation to be more consistent with other implementations in the library.

yonigozlan · 2025-05-27T23:45:29Z

tests/models/eomt/test_modeling_eomt.py

+
+
+@require_torch
+class EoMTForUniversalSegmentationIntegrationTest(unittest.TestCase):


Let's also have end-to-end integration tests for each tasks, using the post_process functions from the processor as well

Initial Commit

189f3e3

github-actions bot marked this pull request as draft April 18, 2025 11:13

qubvel added New model Vision labels Apr 18, 2025

yaswanth19 added 6 commits April 18, 2025 21:29

up

76afcf3

More changes

3923765

up

1b0f0ed

Only mask_logits mismatch

f54a124

close enough logits debug later

53df51d

fixes

357d876

yaswanth19 and others added 9 commits April 23, 2025 08:08

format

6c027f9

Add dummy loss

ad99e71

Close enough processing for semantic seg

ecb2523

Merge branch 'main' into add-eomt-model

2a750fe

nit

f27314e

Added panoptic postprocessor

71dcdff

refactor

de111b6

refactor

a8d87b5

finally fixed panoptic postprocessor

32cce82

tommiekerssies reviewed Apr 30, 2025

View reviewed changes

src/transformers/models/eomt/image_processing_eomt.py Outdated Show resolved Hide resolved

temp update

438fb61

initial tests

704c4c3

yaswanth19 and others added 15 commits May 23, 2025 08:11

updates

007c983

better docstrings

c24f561

changes

62a7518

proc tests passing :)

ad78453

Image processor update

d83b925

tiny change

bfbf492

Merge branch 'main' into add-eomt-model

688cbab

QOL changes

d4ceeb1

Update test w.r.t latest attn refactor

ab78616

repo-consistency fixes

94f5267

up

e27d5fb

Image proc fix and integration tests :)

7d44f3b

docs update

06911f8

integration tests

87a6842

fix

63f6795

yaswanth19 marked this pull request as ready for review May 25, 2025 11:31

yaswanth19 changed the title ~~[WiP] Add EoMT Model~~ Add EoMT Model May 25, 2025

yaswanth19 and others added 5 commits May 25, 2025 18:27

docs update 🥰

7e9195e

minor fix

ac5214f

Merge branch 'main' into add-eomt-model

629a0d2

Happy CI

d55e651

fix

d173d5f

yaswanth19 commented May 25, 2025

View reviewed changes

yaswanth19 requested a review from tommiekerssies May 25, 2025 13:54

yonigozlan reviewed May 27, 2025

View reviewed changes



		@require_torch
		class EoMTForUniversalSegmentationIntegrationTest(unittest.TestCase):

Add EoMT Model #37610

Are you sure you want to change the base?

Add EoMT Model #37610

Conversation

yaswanth19 commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

github-actions bot commented Apr 18, 2025

Uh oh!

Rocketknight1 commented Apr 22, 2025

Uh oh!

yaswanth19 commented Apr 27, 2025

Uh oh!

qubvel commented Apr 28, 2025

Uh oh!

yaswanth19 commented Apr 28, 2025

Uh oh!

Uh oh!

tommiekerssies commented Apr 30, 2025

Uh oh!

yaswanth19 commented Apr 30, 2025

Uh oh!

tommiekerssies commented Apr 30, 2025

Uh oh!

yaswanth19 commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tommiekerssies commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yonigozlan commented May 20, 2025

Uh oh!

yaswanth19 commented May 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yaswanth19 May 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yaswanth19 May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yaswanth19 May 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tommiekerssies commented May 26, 2025

Uh oh!

yonigozlan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

yaswanth19 commented Apr 18, 2025 •

edited

Loading

yaswanth19 commented May 1, 2025 •

edited

Loading

tommiekerssies commented May 19, 2025 •

edited

Loading

yaswanth19 commented May 25, 2025 •

edited

Loading

yaswanth19 May 25, 2025 •

edited

Loading

yaswanth19 May 26, 2025 •

edited

Loading

yaswanth19 May 25, 2025 •

edited

Loading