Fine-tune ViT Models on Higher Resolution Images #987

D0miH · 2024-11-12T10:33:28Z

D0miH
Nov 12, 2024

Hi all,
thank you so much for this awesome library!!

I am using the vision transformer CLIP models and I would like to test some properties with different number of patches.
The problem I have now is that there is only a limited number of models available with different patch sizes. For example, there are only the ViT-B/16 and the ViT-B/32 models, which have the same number of trainable parameters but different patch sizes.

Therefore, I would like to emulate different patch sizes by scaling the input images and fine-tuning the model. I know that the ViTs expect input images of size 224x224. However, in the original CLIP paper (sec. 3.2) they are using higher resolution images for fine-tuning.

So now to my question:

Would it be possible to fine-tune the ViT-B/32 model with higher resolution images to emulate an arbitrary smaller batch size?
How can I interpolate the positional embeddings? I have found a function for that here within this repository. But as far as I can see, it is never used anywhere.

Thank you so much for your help!

rwightman · 2024-11-12T16:17:01Z

rwightman
Nov 12, 2024
Maintainer

@D0miH if you pass force_image_size (cmd line --force-image-size) with the new image size to the model factories when you load pretrain weights it will interpolate for you... after that you should create a new config with that image size baked in. The train / eval scripts have that argument too and can be used to evaluate at a different resolution, or to fine-tune at a different resolution.

open_clip/src/open_clip/factory.py

Lines 208 to 253 in 49eac2f

    
           def create_model( 
        
                   model_name: str, 
        
                   pretrained: Optional[str] = None, 
        
                   precision: str = 'fp32', 
        
                   device: Union[str, torch.device] = 'cpu', 
        
                   jit: bool = False, 
        
                   force_quick_gelu: bool = False, 
        
                   force_custom_text: bool = False, 
        
                   force_patch_dropout: Optional[float] = None, 
        
                   force_image_size: Optional[Union[int, Tuple[int, int]]] = None, 
        
                   force_preprocess_cfg: Optional[Dict[str, Any]] = None, 
        
                   pretrained_image: bool = False, 
        
                   pretrained_hf: bool = True, 
        
                   cache_dir: Optional[str] = None, 
        
                   output_dict: Optional[bool] = None, 
        
                   require_pretrained: bool = False, 
        
                   load_weights_only: bool = True, 
        
                   **model_kwargs, 
        
           ): 
        
               """Creates and configures a contrastive vision-language model. 
        
               Args: 
        
                   model_name: Name of the model architecture to create. Can be a local model name 
        
                       or a Hugging Face model ID prefixed with 'hf-hub:'. 
        
                   pretrained: Tag/path for pretrained model weights. Can be: 
        
                       - A pretrained tag name (e.g., 'openai') 
        
                       - A path to local weights 
        
                       - None to initialize with random weights 
        
                   precision: Model precision/AMP configuration. Options: 
        
                       - 'fp32': 32-bit floating point 
        
                       - 'fp16'/'bf16': Mixed precision with FP32 for certain layers 
        
                       - 'pure_fp16'/'pure_bf16': Pure 16-bit precision 
        
                   device: Device to load the model on ('cpu', 'cuda', or torch.device object) 
        
                   jit: If True, JIT compile the model 
        
                   force_quick_gelu: Force use of QuickGELU activation 
        
                   force_custom_text: Force use of custom text encoder 
        
                   force_patch_dropout: Override default patch dropout value 
        
                   force_image_size: Override default image size for vision encoder 
        
                   force_preprocess_cfg: Override default preprocessing configuration 
        
                   pretrained_image: Load pretrained weights for timm vision models 
        
                   pretrained_hf: Load pretrained weights for HF text models when not loading CLIP weights 
        
                   cache_dir: Override default cache directory for downloaded model files 
        
                   output_dict: If True and model supports it, return dictionary of features 
        
                   require_pretrained: Raise error if pretrained weights cannot be loaded 
        
                   load_weights_only: Only deserialize model weights and unpickling torch checkpoints (for safety) 
        
                   **model_kwargs: Additional keyword arguments passed to model constructor

0 replies

RuixiangZhao · 2025-02-28T03:19:04Z

RuixiangZhao
Feb 28, 2025

@D0miH Have you made any progress? I'm also looking to use high-resolution inputs for CLIP in the video-text retrieval task, but simply applying interpolation doesn't seem to improve performance. Specifically, I used the interpolation method provided here, where I input 320x320 frames into CLIP-B/32. However, after fine-tuning, the performance is still worse than the original 224x224 inputs. Do you have any suggestions?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine-tune ViT Models on Higher Resolution Images #987

{{title}}

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Fine-tune ViT Models on Higher Resolution Images #987

D0miH Nov 12, 2024

Replies: 2 comments

rwightman Nov 12, 2024 Maintainer

RuixiangZhao Feb 28, 2025

D0miH
Nov 12, 2024

rwightman
Nov 12, 2024
Maintainer

RuixiangZhao
Feb 28, 2025