🚨 Fix `torch.jit.trace` for `interpolate_pos_encoding` in all vision models #33226

xenova · 2024-08-30T22:47:21Z

What does this PR do?

A much needed overhaul of interpolate_pos_encoding to remove python type casts and use the size variant of torch.nn.interpolate instead of scale_factor, which was error prone. This is done by abstracting the function into a separate vision utils file to account for class embeddings, upcasting before interpolation, etc., The option to copy this function across files was there, but this is a lot cleaner (imo) and will prevent such issues in future.

This PR has the following benefits:

Fix TorchScript/torch.jit.trace to support dynamic shapes by avoiding python typecasts and ensuring the correct branch is taken for interpolation. Among other things, this means these vision models can be exported to ONNX with dynamic shapes enabled.
Remove hacky + 0.1 offset to prevent precision issues (original issue. This was originally done in dinov1 and then corrected in dinov2 (here), which this implementation takes advantage of.
Fixes missing self.config=config in SwinEmbeddings
This will produce more accurate results when running inference at dimensions that the models weren't trained on. This is because the +0.1 offset, combined with the scale_factor produced slightly off-center values.
Remove unnecessary assertion for checking size after interpolation

Fixes #33181 #32410

Linked PRs in Optimum:

Add ONNX export support for DinoV2, Hiera, Maskformer, PVT, SigLIP, SwinV2, VitMAE, and VitMSN models optimum#2001 enables ONNX export support for some of these models
Add support for export SigLIP models optimum#1897 SigLIP export support
[Tracking] TorchDynamo ONNX exporter issues optimum#1810 TorchDynamo ONNX expor tracking

Overview of models

model_type	num_class_embeds	fixed	ONNX exportable	model_id	comments
beit	1	✅	✅	microsoft/beit-base-patch16-224-pt22k-ft22k
blip	1	✅	❌		ONNX export not supported by Optimum
blip_2	1	✅	❌		ONNX export not supported by Optimum
data2vec	1	✅	✅	hf-internal-testing/tiny-random-Data2VecVisionModel
deit	2	✅	✅	facebook/deit-tiny-distilled-patch16-224
vit_hybrid	1	✅	❌		deprecated
dinov2	1	✅	✅	facebook/dinov2-small-imagenet1k-1-layer
donut	1	✅	✅	hf-internal-testing/tiny-random-DonutSwinModel
dpt	1	✅	✅	Intel/dpt-hybrid-midas	has separate _resize_pos_embed method
flava	1	✅	❌		ONNX export not supported by Optimum
groupvit	0	✅	✅	hf-internal-testing/tiny-random-GroupViTModel
hiera	0	✅	✅	facebook/hiera-tiny-224-in1k-hf
idefics	n/a	n/a	n/a		relies on vision_model's forward call
instructblip	1	✅	❌		ONNX export not supported by Optimum
instructblipvideo	1	✅	❌		ONNX export not supported by Optimum
maskformer	1	✅	✅	facebook/maskformer-swin-tiny-coco
perceiver	0	✅	✅		Custom interpolate function
pvt	0	✅	✅		Custom interpolate function
seggpt	0	✅	❌		ONNX export not supported by Optimum; Custom interpolate function
siglip	0	✅	✅	google/siglip-base-patch16-224
swin	1	✅	✅	microsoft/swin-tiny-patch4-window7-224	- SwinEmbeddings was missing self.config=config
swinv2	1	✅	✅	microsoft/swinv2-tiny-patch4-window16-256
tvp	0	❌	❌		custom implementation; suffers from precision issue that dino fixed;
vit	1	✅	✅	google/vit-base-patch16-224
vit_mae	1	✅	❌		ONNX export not supported by Optimum
vit_msn	1	✅	✅	hf-internal-testing/tiny-random-ViTMSNForImageClassification
vivit	1	✅	❌		ONNX export not supported by Optimum

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@amyeroberts @NielsRogge @merveenoyan @qubvel

…models

HuggingFaceDocBuilderDev · 2024-08-30T23:08:45Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

xenova · 2024-08-30T23:38:56Z

Failing tests seems unrelated (beam search; nothing to do with this PR).

Edit: For 100% backwards compatibility, perhaps we should add some "legacy" config value, and default to the previous version (with +0.1 offset), like done here?

amyeroberts

Thanks for opening this PR, the detailed write up and fixing this for all of these models!

Overall the change looks good and it's great to have this finally fixed ❤️

Two comments:

I'm not sure about the abstraction of the logic into modeling_vision_utils.py module. "modeling_vision_utils" isn't a very well-defined utlity module. Utils have a tendency to gather code-dust and the line between what is a vision utility versus a common pattern for modeling isn't clear-cut. Unlike other utilities e.g. attention masks or rope, which are more model independent and it's clear what type of objects should belong in that file. It's more in-line with transformers to update the logic in all of these models using #Copied from statements.
We shouldn't remove the docstrings from the public methods

amyeroberts · 2024-09-02T11:16:02Z

Edit: For 100% backwards compatibility, perhaps we should add some "legacy" config value, and default to the previous version (with +0.1 offset), like done here?

In this case, as not using scale_factor and the 0.1 offset is better and more correct, I'd say we could just consider this a fix and update without having to handle BC. We should mark the PR with a 🚨 prefix in this case to flag for when we're preparing the release notes.

SangbumChoi · 2024-09-03T06:42:06Z

@xenova Hi xenova. I'm highly interested converting and also double-check the ONNX format inference in maskformer, mask2former, etc..(Vision specific model). Could you share the CLI format that you inserted to convert maskformer in here?

xenova · 2024-09-03T17:58:51Z

@amyeroberts:

It's more in-line with transformers to update the logic in all of these models using #Copied from statements.

Sure I can do that! I will choose dino as the source for when num_class_embeds=1.

We shouldn't remove the docstrings from the public methods

Agreed - will add back!

@SangbumChoi Using huggingface/optimum#2001:

optimum-cli export onnx --model MODEL_ID_GOES_HERE

should work

xenova · 2024-09-04T14:28:31Z

Addressed comments @amyeroberts and tests pass now (was a flaky fail last time)

amyeroberts

Beautiful - thanks for fixing this!

xenova · 2024-09-05T14:15:17Z

Added 🚨 to PR! Merging!

100 · 2024-09-06T20:40:47Z

Hi @xenova thanks for making this PR. Would it be possible to also add this fix for the models added in #29261 (i.e. Blip2VisionModelWithProjection)? Or if it's already supported let me know -- it seems the forward method in that model currently does not support the interpolate_pos_encoding argument even though the other BLIP-2 models do.

…models (huggingface#33226) * Fix `torch.jit.tracing` for `interpolate_pos_encoding` in all vision models * Apply formatting * Add missing `self.config = config` * Fix copies * Fix hiera interpolation unit test * Formatting * Update `_import_structure` * make style * Fix docstring * Use `# Copied from` instead of utils * DeiT variable renaming (`class_and_dist_pos_embed`) * Fix Hiera `interpolate_pos_encoding`

xenova added 4 commits August 30, 2024 22:11

Fix torch.jit.tracing for interpolate_pos_encoding in all vision …

6b21834

…models

Apply formatting

7da8de0

Add missing self.config = config

92597bf

Fix copies

615252e

xenova added 4 commits August 30, 2024 23:12

Fix hiera interpolation unit test

005de60

Formatting

fc1848d

Update _import_structure

c0925e3

make style

a6f858f

amyeroberts reviewed Sep 2, 2024

View reviewed changes

xenova added 4 commits September 4, 2024 12:37

Fix docstring

fa2ca50

Use # Copied from instead of utils

7eb657f

DeiT variable renaming (class_and_dist_pos_embed)

17002b9

Fix Hiera interpolate_pos_encoding

eea5c32

xenova requested a review from amyeroberts September 4, 2024 14:27

amyeroberts approved these changes Sep 4, 2024

View reviewed changes

xenova changed the title ~~Fix torch.jit.trace for interpolate_pos_encoding in all vision models~~ 🚨 Fix torch.jit.trace for interpolate_pos_encoding in all vision models Sep 5, 2024

xenova merged commit c6d2848 into main Sep 5, 2024
23 checks passed

xenova deleted the fix-interpolate-pos-encoding branch September 5, 2024 14:17

xenova mentioned this pull request Sep 5, 2024

Fix ViT-MAE decoder interpolate #33330

Merged

5 tasks

amyeroberts mentioned this pull request Sep 12, 2024

adding positional encoder changes and tests #32600

Merged

ArthurZucker mentioned this pull request Oct 7, 2024

properly fix and RUN_SLOW #33965

Merged

manuelsh mentioned this pull request Oct 25, 2024

ClipSeg broken #34415

Closed

4 tasks

xenova mentioned this pull request Dec 24, 2024

Add DINOv2 with registers #35348

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🚨 Fix `torch.jit.trace` for `interpolate_pos_encoding` in all vision models #33226

🚨 Fix `torch.jit.trace` for `interpolate_pos_encoding` in all vision models #33226

Uh oh!

xenova commented Aug 30, 2024

Uh oh!

HuggingFaceDocBuilderDev commented Aug 30, 2024

Uh oh!

xenova commented Aug 30, 2024 •

edited

Loading

Uh oh!

amyeroberts left a comment

Uh oh!

amyeroberts commented Sep 2, 2024

Uh oh!

SangbumChoi commented Sep 3, 2024 •

edited

Loading

Uh oh!

xenova commented Sep 3, 2024 •

edited

Loading

Uh oh!

xenova commented Sep 4, 2024

Uh oh!

amyeroberts left a comment

Uh oh!

xenova commented Sep 5, 2024

Uh oh!

Uh oh!

100 commented Sep 6, 2024

Uh oh!

Uh oh!

🚨 Fix torch.jit.trace for interpolate_pos_encoding in all vision models #33226

🚨 Fix torch.jit.trace for interpolate_pos_encoding in all vision models #33226

Uh oh!

Conversation

xenova commented Aug 30, 2024

What does this PR do?

Overview of models

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Aug 30, 2024

Uh oh!

xenova commented Aug 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amyeroberts left a comment

Choose a reason for hiding this comment

Uh oh!

amyeroberts commented Sep 2, 2024

Uh oh!

SangbumChoi commented Sep 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xenova commented Sep 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xenova commented Sep 4, 2024

Uh oh!

amyeroberts left a comment

Choose a reason for hiding this comment

Uh oh!

xenova commented Sep 5, 2024

Uh oh!

Uh oh!

100 commented Sep 6, 2024

Uh oh!

Uh oh!

🚨 Fix `torch.jit.trace` for `interpolate_pos_encoding` in all vision models #33226

🚨 Fix `torch.jit.trace` for `interpolate_pos_encoding` in all vision models #33226

xenova commented Aug 30, 2024 •

edited

Loading

SangbumChoi commented Sep 3, 2024 •

edited

Loading

xenova commented Sep 3, 2024 •

edited

Loading