Skip to content

Add EoMT Model #37610

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 66 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
189f3e3
Initial Commit
yaswanth19 Apr 18, 2025
76afcf3
up
yaswanth19 Apr 18, 2025
3923765
More changes
yaswanth19 Apr 18, 2025
1b0f0ed
up
yaswanth19 Apr 19, 2025
f54a124
Only mask_logits mismatch
yaswanth19 Apr 19, 2025
53df51d
close enough logits debug later
yaswanth19 Apr 19, 2025
357d876
fixes
yaswanth19 Apr 19, 2025
6c027f9
format
yaswanth19 Apr 23, 2025
ad99e71
Add dummy loss
yaswanth19 Apr 25, 2025
ecb2523
Close enough processing for semantic seg
yaswanth19 Apr 25, 2025
2a750fe
Merge branch 'main' into add-eomt-model
yaswanth19 Apr 25, 2025
f27314e
nit
yaswanth19 Apr 26, 2025
71dcdff
Added panoptic postprocessor
yaswanth19 Apr 26, 2025
de111b6
refactor
yaswanth19 Apr 27, 2025
a8d87b5
refactor
yaswanth19 Apr 27, 2025
32cce82
finally fixed panoptic postprocessor
yaswanth19 Apr 27, 2025
438fb61
temp update
yaswanth19 Apr 30, 2025
4946015
Merge branch 'main' into add-eomt-model
yaswanth19 May 9, 2025
4a26429
Refactor ForUniversalSegmentation class
yaswanth19 May 9, 2025
01c7a58
nits and config update
yaswanth19 May 9, 2025
82ca6d3
Few fixes and inference matches
yaswanth19 May 10, 2025
a85b651
change mapping
yaswanth19 May 10, 2025
39e97e9
Added training support but loss slightly off 🥲
yaswanth19 May 10, 2025
6dd5fb3
Loss is matching 😀
yaswanth19 May 10, 2025
ff0fce9
update
yaswanth19 May 10, 2025
95753b9
Initial tests skelton
yaswanth19 May 10, 2025
39b57a1
changes
yaswanth19 May 18, 2025
9ef1006
tests update
yaswanth19 May 18, 2025
51f252a
more modular
yaswanth19 May 18, 2025
704c4c3
initial tests
yaswanth19 May 18, 2025
007c983
updates
yaswanth19 May 23, 2025
c24f561
better docstrings
yaswanth19 May 24, 2025
62a7518
changes
yaswanth19 May 24, 2025
ad78453
proc tests passing :)
yaswanth19 May 24, 2025
d83b925
Image processor update
yaswanth19 May 24, 2025
bfbf492
tiny change
yaswanth19 May 24, 2025
688cbab
Merge branch 'main' into add-eomt-model
yaswanth19 May 24, 2025
d4ceeb1
QOL changes
yaswanth19 May 24, 2025
ab78616
Update test w.r.t latest attn refactor
yaswanth19 May 24, 2025
94f5267
repo-consistency fixes
yaswanth19 May 25, 2025
e27d5fb
up
yaswanth19 May 25, 2025
7d44f3b
Image proc fix and integration tests :)
yaswanth19 May 25, 2025
06911f8
docs update
yaswanth19 May 25, 2025
87a6842
integration tests
yaswanth19 May 25, 2025
63f6795
fix
yaswanth19 May 25, 2025
7e9195e
docs update 🥰
yaswanth19 May 25, 2025
ac5214f
minor fix
yaswanth19 May 25, 2025
629a0d2
Merge branch 'main' into add-eomt-model
yaswanth19 May 25, 2025
d55e651
Happy CI
yaswanth19 May 25, 2025
d173d5f
fix
yaswanth19 May 25, 2025
76541a7
obvious refactoring
yaswanth19 May 30, 2025
7f27e69
Merge branch 'main' into add-eomt-model
yaswanth19 May 31, 2025
481ce93
refactoring w.r.t review
yaswanth19 May 31, 2025
19d4a4d
Add fask image proc skelton
yaswanth19 May 31, 2025
894ca84
Fast Image proc and cleanups
yaswanth19 Jun 1, 2025
ff504fd
Use more modular
yaswanth19 Jun 1, 2025
bb27ba2
tests update
yaswanth19 Jun 1, 2025
eaf7b88
Add more tests
yaswanth19 Jun 1, 2025
21c6a47
Nit
yaswanth19 Jun 1, 2025
19f1d93
QOL updates
yaswanth19 Jun 1, 2025
86492e8
change init_weights to torch default
yaswanth19 Jun 1, 2025
ac0b67d
add eager func coz of make style
yaswanth19 Jun 1, 2025
0d899f2
up
yaswanth19 Jun 1, 2025
5f0c82e
changes
yaswanth19 Jun 1, 2025
c2f0be1
typo fix
yaswanth19 Jun 1, 2025
2b2eb50
Merge branch 'main' into add-eomt-model
yaswanth19 Jun 2, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -737,6 +737,8 @@
title: EfficientFormer
- local: model_doc/efficientnet
title: EfficientNet
- local: model_doc/eomt
title: EoMT
- local: model_doc/focalnet
title: FocalNet
- local: model_doc/glpn
Expand Down
211 changes: 211 additions & 0 deletions docs/source/en/model_doc/eomt.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,211 @@
<!--Copyright 2025 Mobile Perception Systems Lab at TU/e and The HuggingFace Inc. team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->

# EoMT

<div class="flex flex-wrap space-x-1">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
</div>

## Overview

The Encoder-only Mask Transformer (EoMT) model was introduced in the CVPR 2025 Highlight Paper [Your ViT is Secretly an Image Segmentation Model](https://www.tue-mps.org/eomt) by Tommie Kerssies, Niccolò Cavagnero, Alexander Hermans, Narges Norouzi, Giuseppe Averta, Bastian Leibe, Gijs Dubbelman, and Daan de Geus.
EoMT reveals Vision Transformers can perform image segmentation efficiently without task-specific components.

The abstract from the paper is the following:

*Vision Transformers (ViTs) have shown remarkable performance and scalability across various computer vision tasks. To apply single-scale ViTs to image segmentation, existing methods adopt a convolutional adapter to generate multi-scale features, a pixel decoder to fuse these features, and a Transformer decoder that uses the fused features to make predictions. In this paper, we show that the inductive biases introduced by these task-specific components can instead be learned by the ViT itself, given sufficiently large models and extensive pre-training. Based on these findings, we introduce the Encoder-only Mask Transformer (EoMT), which repurposes the plain ViT architecture to conduct image segmentation. With large-scale models and pre-training, EoMT obtains a segmentation accuracy similar to state-of-the-art models that use task-specific components. At the same time, EoMT is significantly faster than these methods due to its architectural simplicity, e.g., up to 4x faster with ViT-L. Across a range of model sizes, EoMT demonstrates an optimal balance between segmentation accuracy and prediction speed, suggesting that compute resources are better spent on scaling the ViT itself rather than adding architectural complexity.*

This model was contributed by [Yaswanth Gali](https://huggingface.co/yaswanthgali).
The original code can be found [here](https://github.com/tue-mps/eomt).

## Architecture Info

The `EoMT` model uses a DINOv2-pretrained Vision Transformer with **register tokens** as its backbone. EoMT simplifies the segmentation pipeline by relying solely on the encoder, eliminating the need for task-specific decoders commonly used in prior approaches.

Architecturally, EoMT introduces a small set of **learned queries** and a lightweight **mask prediction module**. These queries are injected into the final encoder blocks, enabling **joint attention** between image patches and object queries. During training, **masked attention** is applied to constrain each query to focus on its corresponding region—effectively mimicking cross-attention. This constraint is gradually phased out via a **mask annealing strategy**, allowing for **efficient, decoder-free inference** without compromising segmentation performance.

<div style="text-align: center;">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/eomt_architecture.png"
alt="drawing" width="500"/>
</div>


The model supports semantic, instance, and panoptic segmentation using a unified architecture and task-specific post-processing.

## Usage Examples

Use the Hugging Face implementation of EoMT for inference with pre-trained models.

### Semantic Segmentation

The EoMT model performs semantic segmentation using sliding-window inference. The input image is resized such that the shorter side matches the target input size, then it is split into overlapping crops. Each crop is then passed through the model. After inference, the predicted logits from each crop are stitched back together and rescaled to the original image size to get the final segmentation mask.

> **Note:**
> If you want to use a custom target size for **semantic segmentation**, specify it in the following format:
> `{"shortest_edge": 512}`
> Notice that `longest_edge` is not provided here — this is intentional. For semantic segmentation, images are typically **scaled so that the shortest edge is greater than or equal to the target size** hence longest_edge is not necessary.

```python
import matplotlib.pyplot as plt
import requests
import torch
from PIL import Image

from transformers import EoMTForUniversalSegmentation, EoMTImageProcessor


model_id = "yaswanthgali/ade20k_semantic_eomt_large_512-hf"
processor = EoMTImageProcessor.from_pretrained(model_id)
model = EoMTForUniversalSegmentation.from_pretrained(model_id)

image = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)

# Preprocess the image for semantic segmentation by setting `segmentation_type` arg
inputs = processor(
images=image,
return_tensors="pt",
)

# Remove Patch Offsets from inputs — only used later for post-processing.
patch_offsets = inputs.pop("patch_offsets")

with torch.inference_mode():
outputs = model(**inputs)

# Prepare the original image size in the format (height, width)
original_image_sizes = [image.size]

# Post-process the model outputs to get final segmentation prediction
preds = processor.post_process_semantic_segmentation(
outputs,
patch_offsets=patch_offsets,
original_image_sizes=original_image_sizes,
)

# Visualize the segmentation mask
plt.imshow(preds[0])
plt.axis("off")
plt.title("Semantic Segmentation")
plt.show()

```

### Instance Segmentation

The EoMT model performs instance segmentation using padded inference. The input image is resized so that the longer side matches the target input size, and the shorter side is zero-padded to form a square. The image is then normalized and passed through the model. The postporcessing is similar to panoptic segmentation except the stuff classes are removed (.e.g only thing class are kept).

> **Note:**
> To use a custom target size, specify the size as a dictionary in the following format:
> `{"shortest_edge": 512, "longest_edge": 512}`
> For both instance and panoptic segmentation, input images will be **scaled down** and padded to this target size.

```python
import matplotlib.pyplot as plt
import requests
import torch
from PIL import Image

from transformers import EoMTForUniversalSegmentation, EoMTImageProcessor


model_id = "yaswanthgali/coco_instance_eomt_large_640-hf"
processor = EoMTImageProcessor.from_pretrained(model_id)
model = EoMTForUniversalSegmentation.from_pretrained(model_id)

image = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)

# Preprocess the image for semantic segmentation by setting `segmentation_type` arg
inputs = processor(
images=image,
return_tensors="pt",
)

with torch.inference_mode():
outputs = model(**inputs)

# Prepare the original image size in the format (height, width)
original_image_sizes = [image.size]

# Post-process the model outputs to get final segmentation prediction
preds = processor.post_process_instance_segmentation(
outputs,
original_image_sizes=original_image_sizes,
stuff_classes=[0, 45], # Pass the list of stuff classes to exclude from mask.
)

# Visualize the segmentation mask
plt.imshow(preds[0]["segmentation"])
plt.axis("off")
plt.title("Instance Segmentation")
plt.show()
```

### Panoptic Segmentation

The EoMT model performs panoptic segmentation using the same padded inference strategy as in instance segmentation. After padding and normalization, the model predicts both thing (instances) and stuff (amorphous regions) classes. The resulting mask and class logits are combined through post-processing (adapted from Mask2Former) to produce a unified panoptic segmentation map, along with segment metadata like segment id, class labels and confidence scores.

```python
import matplotlib.pyplot as plt
import requests
import torch
from PIL import Image

from transformers import EoMTForUniversalSegmentation, EoMTImageProcessor


model_id = "yaswanthgali/coco_panoptic_eomt_large_640-hf"
processor = EoMTImageProcessor.from_pretrained(model_id)
model = EoMTForUniversalSegmentation.from_pretrained(model_id)

image = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)

# Preprocess the image for panoptic segmentation by setting `segmentation_type` arg
inputs = processor(
images=image,
return_tensors="pt",
)

with torch.inference_mode():
outputs = model(**inputs)

# Prepare the original image size in the format (height, width)
original_image_sizes = [image.size]

# Post-process the model outputs to get final segmentation prediction
preds = processor.post_process_panoptic_segmentation(
outputs,
original_image_sizes=original_image_sizes,
)

# Visualize the panoptic segmentation mask
plt.imshow(preds[0]["segmentation"])
plt.axis("off")
plt.title("Panoptic Segmentation")
plt.show()
```

## EoMTImageProcessor

[[autodoc]] EoMTImageProcessor

## EoMTImageProcessorFast

[[autodoc]] EoMTImageProcessorFast

## EoMTConfig

[[autodoc]] EoMTConfig

## EoMTForUniversalSegmentation

[[autodoc]] EoMTForUniversalSegmentation
- forward
2 changes: 2 additions & 0 deletions src/transformers/models/auto/configuration_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,7 @@
("emu3", "Emu3Config"),
("encodec", "EncodecConfig"),
("encoder-decoder", "EncoderDecoderConfig"),
("eomt", "EoMTConfig"),
("ernie", "ErnieConfig"),
("ernie_m", "ErnieMConfig"),
("esm", "EsmConfig"),
Expand Down Expand Up @@ -483,6 +484,7 @@
("emu3", "Emu3"),
("encodec", "EnCodec"),
("encoder-decoder", "Encoder decoder"),
("eomt", "EoMT"),
("ernie", "ERNIE"),
("ernie_m", "ErnieM"),
("esm", "ESM"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/image_processing_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,7 @@
("dpt", ("DPTImageProcessor",)),
("efficientformer", ("EfficientFormerImageProcessor",)),
("efficientnet", ("EfficientNetImageProcessor", "EfficientNetImageProcessorFast")),
("eomt", ("EomtImageProcessor", "EoMTImageProcessorFast")),
("flava", ("FlavaImageProcessor", "FlavaImageProcessorFast")),
("focalnet", ("BitImageProcessor", "BitImageProcessorFast")),
("fuyu", ("FuyuImageProcessor",)),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/modeling_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -830,6 +830,7 @@
[
# Model for Universal Segmentation mapping
("detr", "DetrForSegmentation"),
("eomt", "EoMTForUniversalSegmentation"),
("mask2former", "Mask2FormerForUniversalSegmentation"),
("maskformer", "MaskFormerForInstanceSegmentation"),
("oneformer", "OneFormerForUniversalSegmentation"),
Expand Down
29 changes: 29 additions & 0 deletions src/transformers/models/eomt/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING

from ...utils import _LazyModule
from ...utils.import_utils import define_import_structure


if TYPE_CHECKING:
from .configuration_eomt import *
from .image_processing_eomt import *
from .image_processing_eomt_fast import *
from .modeling_eomt import *
else:
import sys

_file = globals()["__file__"]
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)
Loading