Extract intermediate features from SigLIP2 naflex vision encoder #1045
Replies: 1 comment 2 replies
-
@santimontiel yes to forward_intermediates to grabbing intermediate features, and they are in 2D NCHW format by default so convenient for downstream dense pixel features. I will do another release that has that in a pypi version soon. I've been told from others that are using the main branch that it's working well for them. For the NaFlex part I'm working through a naflex impl across timm/open_clip ... the model side isn't too hard, but it's a different model since there are a number of details with the input projection, masking, and especially pos embed and feature (2D) conversion for intermediates that needs special handling. The bigger thing is the dataloading changes, which requires a separate input pipeline for keeping aspect, switching image sizes (target seq len) on the fly. I have a prelim version running but some issues with the multi-process dataloader and co-ordinating the seq_len switching remain. It might take a while to test and finalize (I'm on vacay next week too)... |
Beta Was this translation helpful? Give feedback.
-
Hi, I'm interested in extracting CLIP-like features for high-resolution images, so I thought new SigLIP2 would be a good fit for me. In this sense, I have a couple questions:
Thanks in advance!
Bests,
Santi
Beta Was this translation helpful? Give feedback.
All reactions