Replies: 2 comments
-
@D0miH if you pass open_clip/src/open_clip/factory.py Lines 208 to 253 in 49eac2f |
Beta Was this translation helpful? Give feedback.
-
@D0miH Have you made any progress? I'm also looking to use high-resolution inputs for CLIP in the video-text retrieval task, but simply applying interpolation doesn't seem to improve performance. Specifically, I used the interpolation method provided here, where I input 320x320 frames into CLIP-B/32. However, after fine-tuning, the performance is still worse than the original 224x224 inputs. Do you have any suggestions? |
Beta Was this translation helpful? Give feedback.
-
Hi all,
thank you so much for this awesome library!!
I am using the vision transformer CLIP models and I would like to test some properties with different number of patches.
The problem I have now is that there is only a limited number of models available with different patch sizes. For example, there are only the ViT-B/16 and the ViT-B/32 models, which have the same number of trainable parameters but different patch sizes.
Therefore, I would like to emulate different patch sizes by scaling the input images and fine-tuning the model. I know that the ViTs expect input images of size
224x224
. However, in the original CLIP paper (sec. 3.2) they are using higher resolution images for fine-tuning.So now to my question:
Thank you so much for your help!
Beta Was this translation helpful? Give feedback.
All reactions