Skip to content

caiopetruccirosa/vision-react

Repository files navigation

Vision ReAct

Python version used: 3.12.

Benchmarks to evaluate on:

  • A-OKVQA: A crowdsourced VQA dataset composed of a diverse set of about 25,000 questions that require a broad base of commonsense and world knowledge to answer.
  • CVQA: A culturally diverse multilingual benchmark featuring 10,000 questions across 30 countries and 31 languages, each provided in both English and the local language.
  • ScienceQA: A large‐scale multimodal multiple‐choice science benchmark with questions drawn from elementary to high‐school curricula.

VLMs to evaluate:

  • google/gemma-3-12b-it.
  • Qwen/Qwen2.5-VL-7B-Instruct.
  • meta-llama/Llama-3.2-11B-Vision-Instruct.

Methods to evaluate:

  • Simple VQA inference with VLMs.
  • VQA inference using a VLM-based ReAct agent.

To make inference, run these commands:

$ CUDA_VISIBLE_DEVICES=XX nohup python predict.py --dataset='A-OKVQA' --method='vlm' --data_root_dir='/hadatasets/caio.rosa/vqa' --model='gemma3'      > logs/aokvqa/gemma3_vlm.log 2>&1 &
$ CUDA_VISIBLE_DEVICES=XX nohup python predict.py --dataset='A-OKVQA' --method='vlm' --data_root_dir='/hadatasets/caio.rosa/vqa' --model='qwen2dot5vl' > logs/aokvqa/qwen2dot5vl_vlm.log 2>&1 &
$ CUDA_VISIBLE_DEVICES=XX nohup python predict.py --dataset='A-OKVQA' --method='vlm' --data_root_dir='/hadatasets/caio.rosa/vqa' --model='llama3'      > logs/aokvqa/llama3_vlm.log 2>&1 &

$ CUDA_VISIBLE_DEVICES=XX nohup python predict.py --dataset='A-OKVQA' --method='vlm+wiki' --data_root_dir='/hadatasets/caio.rosa/vqa' --model='gemma3'      > logs/aokvqa/gemma3_vlm_wiki.log 2>&1 &
$ CUDA_VISIBLE_DEVICES=XX nohup python predict.py --dataset='A-OKVQA' --method='vlm+wiki' --data_root_dir='/hadatasets/caio.rosa/vqa' --model='qwen2dot5vl' > logs/aokvqa/qwen2dot5vl_vlm_wiki.log 2>&1 &
$ CUDA_VISIBLE_DEVICES=XX nohup python predict.py --dataset='A-OKVQA' --method='vlm+wiki' --data_root_dir='/hadatasets/caio.rosa/vqa' --model='llama3'      > logs/aokvqa/llama3_vlm_wiki.log 2>&1 &


$ CUDA_VISIBLE_DEVICES=XX nohup python predict.py --dataset='CVQA' --method='vlm' --data_root_dir='/hadatasets/caio.rosa/vqa' --model='gemma3'      > logs/cvqa/gemma3_vlm.log 2>&1 &
$ CUDA_VISIBLE_DEVICES=XX nohup python predict.py --dataset='CVQA' --method='vlm' --data_root_dir='/hadatasets/caio.rosa/vqa' --model='qwen2dot5vl' > logs/cvqa/qwen2dot5vl_vlm.log 2>&1 &
$ CUDA_VISIBLE_DEVICES=XX nohup python predict.py --dataset='CVQA' --method='vlm' --data_root_dir='/hadatasets/caio.rosa/vqa' --model='llama3'      > logs/cvqa/llama3_vlm.log 2>&1 &

$ CUDA_VISIBLE_DEVICES=XX nohup python predict.py --dataset='CVQA' --method='vlm+wiki' --data_root_dir='/hadatasets/caio.rosa/vqa' --model='gemma3'      > logs/cvqa/gemma3_vlm_wiki.log 2>&1 &
$ CUDA_VISIBLE_DEVICES=XX nohup python predict.py --dataset='CVQA' --method='vlm+wiki' --data_root_dir='/hadatasets/caio.rosa/vqa' --model='qwen2dot5vl' > logs/cvqa/qwen2dot5vl_vlm_wiki.log 2>&1 &
$ CUDA_VISIBLE_DEVICES=XX nohup python predict.py --dataset='CVQA' --method='vlm+wiki' --data_root_dir='/hadatasets/caio.rosa/vqa' --model='llama3'      > logs/cvqa/llama3_vlm_wiki.log 2>&1 &


$ CUDA_VISIBLE_DEVICES=XX nohup python predict.py --dataset='ScienceQA' --method='vlm' --data_root_dir='/hadatasets/caio.rosa/vqa' --model='gemma3'      > logs/scienceqa/gemma3_vlm.log 2>&1 &
$ CUDA_VISIBLE_DEVICES=XX nohup python predict.py --dataset='ScienceQA' --method='vlm' --data_root_dir='/hadatasets/caio.rosa/vqa' --model='qwen2dot5vl' > logs/scienceqa/qwen2dot5vl_vlm.log 2>&1 &
$ CUDA_VISIBLE_DEVICES=XX nohup python predict.py --dataset='ScienceQA' --method='vlm' --data_root_dir='/hadatasets/caio.rosa/vqa' --model='llama3'      > logs/scienceqa/llama3_vlm.log 2>&1 &

$ CUDA_VISIBLE_DEVICES=XX nohup python predict.py --dataset='ScienceQA' --method='vlm+wiki' --data_root_dir='/hadatasets/caio.rosa/vqa' --model='gemma3'      > logs/scienceqa/gemma3_vlm_wiki.log 2>&1 &
$ CUDA_VISIBLE_DEVICES=XX nohup python predict.py --dataset='ScienceQA' --method='vlm+wiki' --data_root_dir='/hadatasets/caio.rosa/vqa' --model='qwen2dot5vl' > logs/scienceqa/qwen2dot5vl_vlm_wiki.log 2>&1 &
$ CUDA_VISIBLE_DEVICES=XX nohup python predict.py --dataset='ScienceQA' --method='vlm+wiki' --data_root_dir='/hadatasets/caio.rosa/vqa' --model='llama3'      > logs/scienceqa/llama3_vlm_wiki.log 2>&1 &

To make submissions files for evaluation, run these commands:

$ python make_submission_file.py --dataset='A-OKVQA' --results_file='predictions/A-OKVQA/gemma3_vlm_predictions.json'
$ python make_submission_file.py --dataset='A-OKVQA' --results_file='predictions/A-OKVQA/qwen2dot5vl_vlm_predictions.json'
$ python make_submission_file.py --dataset='A-OKVQA' --results_file='predictions/A-OKVQA/llama3_vlm_predictions.json'

$ python make_submission_file.py --dataset='A-OKVQA' --results_file='predictions/A-OKVQA/gemma3_vlm+wiki_predictions.json'
$ python make_submission_file.py --dataset='A-OKVQA' --results_file='predictions/A-OKVQA/qwen2dot5vl_vlm+wiki_predictions.json'
$ python make_submission_file.py --dataset='A-OKVQA' --results_file='predictions/A-OKVQA/llama3_vlm+wiki_predictions.json'


$ python make_submission_file.py --dataset='CVQA' --results_file='predictions/CVQA/gemma3_vlm_predictions.json'
$ python make_submission_file.py --dataset='CVQA' --results_file='predictions/CVQA/qwen2dot5vl_vlm_predictions.json'
$ python make_submission_file.py --dataset='CVQA' --results_file='predictions/CVQA/llama3_vlm_predictions.json'

$ python make_submission_file.py --dataset='CVQA' --results_file='predictions/CVQA/gemma3_vlm+wiki_predictions.json'
$ python make_submission_file.py --dataset='CVQA' --results_file='predictions/CVQA/qwen2dot5vl_vlm+wiki_predictions.json'
$ python make_submission_file.py --dataset='CVQA' --results_file='predictions/CVQA/llama3_vlm+wiki_predictions.json'


$ python make_submission_file.py --dataset='ScienceQA' --results_file='predictions/ScienceQA/gemma3_vlm_predictions.json'
$ python make_submission_file.py --dataset='ScienceQA' --results_file='predictions/ScienceQA/qwen2dot5vl_vlm_predictions.json'
$ python make_submission_file.py --dataset='ScienceQA' --results_file='predictions/ScienceQA/llama3_vlm_predictions.json'

$ python make_submission_file.py --dataset='ScienceQA' --results_file='predictions/ScienceQA/gemma3_vlm+wiki_predictions.json'
$ python make_submission_file.py --dataset='ScienceQA' --results_file='predictions/ScienceQA/qwen2dot5vl_vlm+wiki_predictions.json'
$ python make_submission_file.py --dataset='ScienceQA' --results_file='predictions/ScienceQA/llama3_vlm+wiki_predictions.json'

To evaluate for ScienceQA, run these commands:

$ python submodules/ScienceQA/tools/evaluate_acc.py --data_file='submodules/ScienceQA/data/scienceqa/problems.json' --result_file='submissions/ScienceQA/gemma3_vlm_submission.json' > evaluations/gemma3_vlm.txt
$ python submodules/ScienceQA/tools/evaluate_acc.py --data_file='submodules/ScienceQA/data/scienceqa/problems.json' --result_file='submissions/ScienceQA/qwen2dot5vl_vlm_submission.json' > evaluations/qwen2dot5vl_vlm.txt
$ python submodules/ScienceQA/tools/evaluate_acc.py --data_file='submodules/ScienceQA/data/scienceqa/problems.json' --result_file='submissions/ScienceQA/llama3_vlm_submission.json' > evaluations/llama3_vlm.txt

$ python submodules/ScienceQA/tools/evaluate_acc.py --data_file='submodules/ScienceQA/data/scienceqa/problems.json' --result_file='submissions/ScienceQA/gemma3_vlm+wiki_submission.json' > evaluations/gemma3_vlm+wiki.txt
$ python submodules/ScienceQA/tools/evaluate_acc.py --data_file='submodules/ScienceQA/data/scienceqa/problems.json' --result_file='submissions/ScienceQA/qwen2dot5vl_vlm+wiki_submission.json' > evaluations/qwen2dot5vl_vlm+wiki.txt
$ python submodules/ScienceQA/tools/evaluate_acc.py --data_file='submodules/ScienceQA/data/scienceqa/problems.json' --result_file='submissions/ScienceQA/llama3_vlm+wiki_submission.json' > evaluations/llama3_vlm+wiki.txt

About

Vision ReAct, an agentic VLM-based method for VQA. Assignment done for the EMNLP course (Empirical Methods in Natural Language Processing) at Peking University.

Resources

Stars

Watchers

Forks

Contributors