This repository contains the code used in our paper "Tracing Information Flow in LLaMA Vision: A Step Toward Multimodal Understanding".
We present the first systematic analysis of the information flow between language and vision modalities in LLaMA 3.2-Vision, aiming to advance the understanding of the internal dynamics of of multimodal large language models (MLLMs).
This repository provides everything needed to reproduce our experiments and results on visual question answering (VQA) tasks using the following datasets:
eval/
– Evaluation scriptsmodel/
- Model architecture definitionstracing_information_flow/
- Scripts to create the datasets used for our analysis
Before running this project, you need to download the folder with the config and checkpoint files of LLaMA-3.2-Vision from HuggingFace, specifically the 11B version. Then set in the config.json
file this configuration:
"_attn_implementation": "eager"
Clone this repository, create a conda env for the project and activate it. Then install all the dependencies with pip.
conda create -n llama_tracing python=3.10.17
conda activate llama_tracing
pip install -r requirements.txt
To analyze the information flow in LLaMA 3.2-Vision across the VQAv2, Visual7W, and DocVQA datasets, run the corresponding scripts vqa_v2_with_attention_blocking.py
, visual7w_with_attention_blocking
, and docvqa_with_attention_blocking
with the following arguments:
--model_path <MODEL_PATH> \ # Directory for the checkpoint of the model
--image_dir <IMAGE_DIR> \ # Directory containing the images
--block_types <BLOCK_TYPE> \ # Pathway you want to block
--k <K_VALUE> # Window size K
where <BLOCK_TYPE>
can be chosen from:
"last_to_last"
"question_to_last"
"image_to_last"
"image_to_question"
Run the corresponding scripts for each dataset using the following arguments.
VQAv2:
python vqa_v2.py \
--model_path <MODEL_PATH> \ # Directory for the checkpoint of the model
--image_dir <IMAGE_DIR> \ # Directory containing the images
--annotation_path <ANNOTATION_PATH> \ # Path to the annotation json file
--question_path <QUESTION_PATH> \ # Path to the question json file
Visual7W:
python visual7w.py \
--model_path <MODEL_PATH> \ # Directory for the checkpoint of the model
--image_dir <IMAGE_DIR> \ # Directory containing the images
--annotation_path <ANNOTATION_PATH> \ # Path to the annotation json file
DocVQA:
python docvqa.py \
--model_path <MODEL_PATH> \ # Directory for the checkpoint of the model
--image_dir <IMAGE_DIR> \ # Directory containing the images
--annotation_path <ANNOTATION_PATH> \ # Path to the annotation json file
To create datasets with the model's correctly predicted answers, run the evaluation scripts below:
eval/eval_vqa.py
eval/eval_visual7w.py
eval/eval_docvqa.py
Each script taskes:
--results <RESULT_PATH> # Path to your result file
To generate the final datasets used for information flow analysis, run:
tracing_information_flow/create_vqa.py
tracing_information_flow/create_visual7w.py
tracing_information_flow/create_docvqa.py
Before running these scripts, make sure to set the correct paths at the top of each file. The final datasets will be created inside the corresponding subfolders in:
tracing_information_flow/dataset/