Python version used: 3.12.
Benchmarks to evaluate on:
- A-OKVQA: A crowdsourced VQA dataset composed of a diverse set of about 25,000 questions that require a broad base of commonsense and world knowledge to answer.
- CVQA: A culturally diverse multilingual benchmark featuring 10,000 questions across 30 countries and 31 languages, each provided in both English and the local language.
- ScienceQA: A large‐scale multimodal multiple‐choice science benchmark with questions drawn from elementary to high‐school curricula.
VLMs to evaluate:
google/gemma-3-12b-it.Qwen/Qwen2.5-VL-7B-Instruct.meta-llama/Llama-3.2-11B-Vision-Instruct.
Methods to evaluate:
- Simple VQA inference with VLMs.
- VQA inference using a VLM-based ReAct agent.
To make inference, run these commands:
$ CUDA_VISIBLE_DEVICES=XX nohup python predict.py --dataset='A-OKVQA' --method='vlm' --data_root_dir='/hadatasets/caio.rosa/vqa' --model='gemma3' > logs/aokvqa/gemma3_vlm.log 2>&1 &
$ CUDA_VISIBLE_DEVICES=XX nohup python predict.py --dataset='A-OKVQA' --method='vlm' --data_root_dir='/hadatasets/caio.rosa/vqa' --model='qwen2dot5vl' > logs/aokvqa/qwen2dot5vl_vlm.log 2>&1 &
$ CUDA_VISIBLE_DEVICES=XX nohup python predict.py --dataset='A-OKVQA' --method='vlm' --data_root_dir='/hadatasets/caio.rosa/vqa' --model='llama3' > logs/aokvqa/llama3_vlm.log 2>&1 &
$ CUDA_VISIBLE_DEVICES=XX nohup python predict.py --dataset='A-OKVQA' --method='vlm+wiki' --data_root_dir='/hadatasets/caio.rosa/vqa' --model='gemma3' > logs/aokvqa/gemma3_vlm_wiki.log 2>&1 &
$ CUDA_VISIBLE_DEVICES=XX nohup python predict.py --dataset='A-OKVQA' --method='vlm+wiki' --data_root_dir='/hadatasets/caio.rosa/vqa' --model='qwen2dot5vl' > logs/aokvqa/qwen2dot5vl_vlm_wiki.log 2>&1 &
$ CUDA_VISIBLE_DEVICES=XX nohup python predict.py --dataset='A-OKVQA' --method='vlm+wiki' --data_root_dir='/hadatasets/caio.rosa/vqa' --model='llama3' > logs/aokvqa/llama3_vlm_wiki.log 2>&1 &
$ CUDA_VISIBLE_DEVICES=XX nohup python predict.py --dataset='CVQA' --method='vlm' --data_root_dir='/hadatasets/caio.rosa/vqa' --model='gemma3' > logs/cvqa/gemma3_vlm.log 2>&1 &
$ CUDA_VISIBLE_DEVICES=XX nohup python predict.py --dataset='CVQA' --method='vlm' --data_root_dir='/hadatasets/caio.rosa/vqa' --model='qwen2dot5vl' > logs/cvqa/qwen2dot5vl_vlm.log 2>&1 &
$ CUDA_VISIBLE_DEVICES=XX nohup python predict.py --dataset='CVQA' --method='vlm' --data_root_dir='/hadatasets/caio.rosa/vqa' --model='llama3' > logs/cvqa/llama3_vlm.log 2>&1 &
$ CUDA_VISIBLE_DEVICES=XX nohup python predict.py --dataset='CVQA' --method='vlm+wiki' --data_root_dir='/hadatasets/caio.rosa/vqa' --model='gemma3' > logs/cvqa/gemma3_vlm_wiki.log 2>&1 &
$ CUDA_VISIBLE_DEVICES=XX nohup python predict.py --dataset='CVQA' --method='vlm+wiki' --data_root_dir='/hadatasets/caio.rosa/vqa' --model='qwen2dot5vl' > logs/cvqa/qwen2dot5vl_vlm_wiki.log 2>&1 &
$ CUDA_VISIBLE_DEVICES=XX nohup python predict.py --dataset='CVQA' --method='vlm+wiki' --data_root_dir='/hadatasets/caio.rosa/vqa' --model='llama3' > logs/cvqa/llama3_vlm_wiki.log 2>&1 &
$ CUDA_VISIBLE_DEVICES=XX nohup python predict.py --dataset='ScienceQA' --method='vlm' --data_root_dir='/hadatasets/caio.rosa/vqa' --model='gemma3' > logs/scienceqa/gemma3_vlm.log 2>&1 &
$ CUDA_VISIBLE_DEVICES=XX nohup python predict.py --dataset='ScienceQA' --method='vlm' --data_root_dir='/hadatasets/caio.rosa/vqa' --model='qwen2dot5vl' > logs/scienceqa/qwen2dot5vl_vlm.log 2>&1 &
$ CUDA_VISIBLE_DEVICES=XX nohup python predict.py --dataset='ScienceQA' --method='vlm' --data_root_dir='/hadatasets/caio.rosa/vqa' --model='llama3' > logs/scienceqa/llama3_vlm.log 2>&1 &
$ CUDA_VISIBLE_DEVICES=XX nohup python predict.py --dataset='ScienceQA' --method='vlm+wiki' --data_root_dir='/hadatasets/caio.rosa/vqa' --model='gemma3' > logs/scienceqa/gemma3_vlm_wiki.log 2>&1 &
$ CUDA_VISIBLE_DEVICES=XX nohup python predict.py --dataset='ScienceQA' --method='vlm+wiki' --data_root_dir='/hadatasets/caio.rosa/vqa' --model='qwen2dot5vl' > logs/scienceqa/qwen2dot5vl_vlm_wiki.log 2>&1 &
$ CUDA_VISIBLE_DEVICES=XX nohup python predict.py --dataset='ScienceQA' --method='vlm+wiki' --data_root_dir='/hadatasets/caio.rosa/vqa' --model='llama3' > logs/scienceqa/llama3_vlm_wiki.log 2>&1 &To make submissions files for evaluation, run these commands:
$ python make_submission_file.py --dataset='A-OKVQA' --results_file='predictions/A-OKVQA/gemma3_vlm_predictions.json'
$ python make_submission_file.py --dataset='A-OKVQA' --results_file='predictions/A-OKVQA/qwen2dot5vl_vlm_predictions.json'
$ python make_submission_file.py --dataset='A-OKVQA' --results_file='predictions/A-OKVQA/llama3_vlm_predictions.json'
$ python make_submission_file.py --dataset='A-OKVQA' --results_file='predictions/A-OKVQA/gemma3_vlm+wiki_predictions.json'
$ python make_submission_file.py --dataset='A-OKVQA' --results_file='predictions/A-OKVQA/qwen2dot5vl_vlm+wiki_predictions.json'
$ python make_submission_file.py --dataset='A-OKVQA' --results_file='predictions/A-OKVQA/llama3_vlm+wiki_predictions.json'
$ python make_submission_file.py --dataset='CVQA' --results_file='predictions/CVQA/gemma3_vlm_predictions.json'
$ python make_submission_file.py --dataset='CVQA' --results_file='predictions/CVQA/qwen2dot5vl_vlm_predictions.json'
$ python make_submission_file.py --dataset='CVQA' --results_file='predictions/CVQA/llama3_vlm_predictions.json'
$ python make_submission_file.py --dataset='CVQA' --results_file='predictions/CVQA/gemma3_vlm+wiki_predictions.json'
$ python make_submission_file.py --dataset='CVQA' --results_file='predictions/CVQA/qwen2dot5vl_vlm+wiki_predictions.json'
$ python make_submission_file.py --dataset='CVQA' --results_file='predictions/CVQA/llama3_vlm+wiki_predictions.json'
$ python make_submission_file.py --dataset='ScienceQA' --results_file='predictions/ScienceQA/gemma3_vlm_predictions.json'
$ python make_submission_file.py --dataset='ScienceQA' --results_file='predictions/ScienceQA/qwen2dot5vl_vlm_predictions.json'
$ python make_submission_file.py --dataset='ScienceQA' --results_file='predictions/ScienceQA/llama3_vlm_predictions.json'
$ python make_submission_file.py --dataset='ScienceQA' --results_file='predictions/ScienceQA/gemma3_vlm+wiki_predictions.json'
$ python make_submission_file.py --dataset='ScienceQA' --results_file='predictions/ScienceQA/qwen2dot5vl_vlm+wiki_predictions.json'
$ python make_submission_file.py --dataset='ScienceQA' --results_file='predictions/ScienceQA/llama3_vlm+wiki_predictions.json'To evaluate for ScienceQA, run these commands:
$ python submodules/ScienceQA/tools/evaluate_acc.py --data_file='submodules/ScienceQA/data/scienceqa/problems.json' --result_file='submissions/ScienceQA/gemma3_vlm_submission.json' > evaluations/gemma3_vlm.txt
$ python submodules/ScienceQA/tools/evaluate_acc.py --data_file='submodules/ScienceQA/data/scienceqa/problems.json' --result_file='submissions/ScienceQA/qwen2dot5vl_vlm_submission.json' > evaluations/qwen2dot5vl_vlm.txt
$ python submodules/ScienceQA/tools/evaluate_acc.py --data_file='submodules/ScienceQA/data/scienceqa/problems.json' --result_file='submissions/ScienceQA/llama3_vlm_submission.json' > evaluations/llama3_vlm.txt
$ python submodules/ScienceQA/tools/evaluate_acc.py --data_file='submodules/ScienceQA/data/scienceqa/problems.json' --result_file='submissions/ScienceQA/gemma3_vlm+wiki_submission.json' > evaluations/gemma3_vlm+wiki.txt
$ python submodules/ScienceQA/tools/evaluate_acc.py --data_file='submodules/ScienceQA/data/scienceqa/problems.json' --result_file='submissions/ScienceQA/qwen2dot5vl_vlm+wiki_submission.json' > evaluations/qwen2dot5vl_vlm+wiki.txt
$ python submodules/ScienceQA/tools/evaluate_acc.py --data_file='submodules/ScienceQA/data/scienceqa/problems.json' --result_file='submissions/ScienceQA/llama3_vlm+wiki_submission.json' > evaluations/llama3_vlm+wiki.txt