When Open-Vocabulary Visual Question Answering Meets Causal Adapter: Benchmark and Approach(accepted to AAAI2025)
While the VQA community has achieved substantial progress, the benchmarks employed thus far have predominantly adhered to the closed-set setting. As depicted in following figure (a), these benchmarks rely on a predefined set of candidate answers (e.g., baseball, skyblue), restricting models to select from this fixed list and impeding their ability to handle unseen concepts (e.g., badminton, citizen). This constraint diminishes model effectiveness in open-world scenarios, where answer distributions are highly varied and dynamic.
To overcome these limitations, we propose a new evaluation benchmark for VQA, termed Open-Vocabulary Visual Question Answering (OVVQA). OVVQA is designed to better align with open-world conditions and to provide a more accurate assessment of a model’s multimodal reasoning abilities.
We propose a novel Causal Adapter to efficiently transfer causal knowledge from the pretrained model to OVVQA, which is designed in a plug-and-play manner for ease of integration. Our framework enables the model to identify the causal relationship between inputs (e.g., visual content) and outputs (e.g., answers), thereby mitigating biases caused by distribution discrepancies between base and novel classes in OVVQA. We begin by formulating OVVQA as a causal graph and analyzing how existing methods establish spurious associations between visual content and answers. Spurious effects are then eliminated through a causal intervention based on the front-door adjustment principle, which is incorporated into the adapter and does not require the assumption of any observed confounder. This makes the proposed Causal Adapter applicable across various domains where the adapter resides. Additionally, we introduce an adaptive transfer loss to improve the fine-tuning process, effectively leveraging knowledge from pretext tasks to enhance performance across base and novel classes.
The table presents the experimental results on our reconstituted OVVQA datasets: OV-VQAv2, OV-GQA, and OV-OKVQA. We report performance across several aspects, including results for base and novel classes, arithmetic mean (Avg), and harmonic mean (H). For OVVQAv2, we further subdivide the novel class into three subcategories: Yes/No (Y/N), Number (Num.), and Other.
# Create python environment (optional)
conda create -n causala python=3.8
source activate causala
# Install python dependencies
pip install -r requirements.txt
# Download T5/BART backbone checkpoint
python download_backbones.py
# Store images, features, and annotations
./datasets
COCO/
images/
featuers/
VG/
images/
features/
GQA/
images/
features/
OVVQAv2/
...
# Run feature extraction
./feature_extraction
# Train VL-T5
./VL-T5/
src/
modeling_t5.py modeling_bart.py <= VL-T5/VL-BART model classes
modeling_adapter.py <= Causal_Adapter
pretrain.py, pretrain_data.py, pretrain_model.py <= pretraining
vqa.py, vqa_data.py, vqa_model.py, gqa.py ... <= fine-tuning on downstream
param.py <= (argparse) configuration
tokenization.py <= custom tokenizer
utils.py, dist_utils.py <= utility functions
snap/ <= store weight checkpoints
scripts/ <= bash scripts for pretraining and finetuning
- Download the OVVQAv2 partition from Google Drive and put it into datasets/OVVQAv2.
- Download the OVGQA partition from Google Drive and put it into datasets/OVGQA.
- Download the OVOKVQA partition from Google Drive and put it into datasets/OVOKVQA.
- Download
datasets/COCO
from Google Drive - Download
datasets/VG
from Google Drive - Download model checkpoints from Google Drive
# Training with 1 gpu for VQA v2
cd VL-T5/
bash scripts/VQA_VLT5.sh # Standard training based VL-T5
bash scripts/VQA_VLBART.sh # Standard training based VL-BART
# Training with 1 gpu for GQA
cd VL-T5/
bash scripts/GQA_VLT5.sh # Standard training based VL-T5
bash scripts/GQA_VLBART.sh # Standard training based VL-BART
# Training with 1 gpu for OKVQA
cd VL-T5/
bash scripts/OKVQA_VLT5.sh # Standard training based VL-T5
bash scripts/OKVQA_VLBART.sh # Standard training based VL-BART
Our model is based on the official VL-T5 repository, we thank the authors to release their code. If you use the related part, please cite the corresponding paper commented in the code.