Extract intermediate features from SigLIP2 naflex vision encoder #1045

santimontiel · 2025-03-06T12:31:10Z

santimontiel
Mar 6, 2025

Hi, I'm interested in extracting CLIP-like features for high-resolution images, so I thought new SigLIP2 would be a good fit for me. In this sense, I have a couple questions:

How can I load naflex variant using open_clip?
How do I extract features? I suppose I can use the new "forward_intermediates" API.

Thanks in advance!

Bests,
Santi

rwightman · 2025-03-07T01:14:09Z

rwightman
Mar 7, 2025
Maintainer

@santimontiel yes to forward_intermediates to grabbing intermediate features, and they are in 2D NCHW format by default so convenient for downstream dense pixel features. I will do another release that has that in a pypi version soon. I've been told from others that are using the main branch that it's working well for them.

For the NaFlex part I'm working through a naflex impl across timm/open_clip ... the model side isn't too hard, but it's a different model since there are a number of details with the input projection, masking, and especially pos embed and feature (2D) conversion for intermediates that needs special handling. The bigger thing is the dataloading changes, which requires a separate input pipeline for keeping aspect, switching image sizes (target seq len) on the fly. I have a prelim version running but some issues with the multi-process dataloader and co-ordinating the seq_len switching remain. It might take a while to test and finalize (I'm on vacay next week too)...

2 replies

rwightman Mar 7, 2025
Maintainer

Also worth pointing out, the primary benefit for the naflex is being able to handle different image sizes AND different aspect ratios. For high resolution it's not much different. In paper they said 1024 was the max seq len they selected from. That is the same as a 512x512 fixed size model with patch size 16.

santimontiel Mar 7, 2025
Author

Thank you for your prompt reply, Ross!

Right now, I'm working with the current main commit and I can confirm that forward_intermediate method is working correctly.

This is a visualization for CLIP features using SigLIP2 applied to nuScenes camera rig. Visualization is done using PCA and min_max_scaling. Do they look good to you?

Original nuScenes image resolution is 900x1600, so maybe I could be benefited using naflex variants, but overall, this seems good.

Bests,
Santi

P.S: Enjoy your vacation week! 😁

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract intermediate features from SigLIP2 naflex vision encoder #1045

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Extract intermediate features from SigLIP2 naflex vision encoder #1045

santimontiel Mar 6, 2025

Replies: 1 comment · 2 replies

rwightman Mar 7, 2025 Maintainer

rwightman Mar 7, 2025 Maintainer

santimontiel Mar 7, 2025 Author

santimontiel
Mar 6, 2025

Replies: 1 comment 2 replies

rwightman
Mar 7, 2025
Maintainer

rwightman Mar 7, 2025
Maintainer

santimontiel Mar 7, 2025
Author