Skip to content

Efficient Test-Time Scaling for Small Vision-Language Models, official implementation of the paper, test-time scaling via test-time augmentation

Notifications You must be signed in to change notification settings

monurcan/efficient_test_time_scaling

Repository files navigation

Efficient Test-Time Scaling for Small Vision-Language Models

Mehmet Onurcan Kaya1,2, Desmond Elliott3,2, Dim P. Papadopoulos1,2

1 Technical University of Denmark     2 Pioneer Center for AI     3 University of Copenhagen

arXiv

Our framework consists of two main pipelines: (1) Test-Time Augmentation: Given an input image and text prompt, we apply various transformations to create multiple augmented versions. VLM processes each augmented input to produce next token probability distributions, which are then aggregated at the token level to generate the final response. (2) Test-Time Adaptation: We create pseudolabels through test-time augmentation and fine-tune the VLM parameters, then repeat the process. Both methods demonstrate effectiveness across nine diverse benchmarks as shown in (b).

🔎 Abstract

Small Vision-Language Models (VLMs) provide a computationally efficient alternative to larger models, at the cost of weaker generalization abilities and downstream task performance. These shortcomings could be addressed by test-time scaling techniques, but existing methods are typically computationally demanding, contradicting the resource-efficient design goals of small models. To address these limitations, we propose two novel and efficient test-time scaling strategies that leverage the model-internal features rather than external supervision: (i) Test-Time Augmentation (TTAug), which generates multiple augmented inputs and aggregates outputs at the token level without parameter updates, and (ii) Test-Time Adaptation (TTAdapt), which adapts model parameters during inference using consensus-based pseudolabels from TTAug. Through extensive experiments across nine benchmarks, we demonstrate consistent performance improvements while maintaining computational efficiency suitable for resource-constrained environments. The generality of our approach is demonstrated both within models at different scales and across different VLMs without additional tuning.

🔧 Installation

git clone https://github.com/monurcan/efficient_test_time_scaling.git
cd efficient_test_time_scaling
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt --no-deps
pip install -e . --no-deps

Note that the code has been tested with Python 3.10.12 and CUDA 12.5.

💻 Inference: Run an Experiment

bash scripts/benchmark.sh benchmark_configs/test_config.json

This will execute the experiment configuration defined in benchmark_configs/test_config.json.

For customizing experiments, refer to the configuration system documentation: docs/en/ConfigSystem.md

Results will be automatically saved to the benchmark_results directory as specified in scripts/benchmark.sh.

🚀 Development

The core logic of our methods is located in vlmeval/vlm/tta

Utility scripts for analysis and visualization are available in scripts:

🙏 Acknowledgement

This project builds upon VLMEvalKit. For more details, refer to README_VLMEVALKIT.md.

📚 Citation

@article{Kaya2025EfficientTTS,
  title={Efficient Test-Time Scaling for Small Vision-Language Models},
  author={Mehmet Onurcan Kaya and Desmond Elliott and Dim P. Papadopoulos},
  journal={arXiv preprint arXiv:2510.03574},
  year={2025},
  url={https://monurcan.github.io/efficient_test_time_scaling}
}

💬 Contact

For questions, please open an issue or contact me at [email protected]

About

Efficient Test-Time Scaling for Small Vision-Language Models, official implementation of the paper, test-time scaling via test-time augmentation

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published