A4-Agent: An Agentic Framework for Zero-Shot Affordance Reasoning

Zixin Zhang^1,4*, Kanghao Chen^1,4*, Hanqing Wang^1*, Hongfei Zhang¹, Harold Haodong Chen^1,4*, Chenfei Liao^1,3, Litao Guo¹, Ying-Cong Chen^1,2✉

¹HKUST(GZ) ²HKUST ³SJTU ⁴Knowin
^*Equal contribution. ^✉Corresponding author.

A4-Agent is an agentic framework designed for zero-shot affordance reasoning. Given an observed object, it integrates image generation, object detection, segmentation, and Vision-Language Models (VLMs) to imagine plausible interactions and localize action-specific parts. A4-Agent achieves state-of-the-art performance across multiple benchmarks in a zero-shot setting, outperforming baseline models specifically trained for affordance prediction tasks.

📰 News

[2025-12] Arxiv paper is now available.
[2025-12] We release the code and website for A4-Agent.

Installation

1. Create a Conda Environment and Install PyTorch

conda create -n a4-agent python=3.11
conda activate a4-agent
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121

2. Install Flash Attention

Flash Attention is required for Rex-Omni. We strongly recommend installing Flash Attention using a pre-built wheel to avoid compilation issues.

You can find the pre-built wheel for your system here. For the environment setup above, use:

pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.5cxx11abiFALSE-cp311-cp311-linux_x86_64.whl
pip install -r requirements.txt

3. Install Rex-Omni & SAM2

pip install git+https://github.com/IDEA-Research/Rex-Omni.git --no-deps
pip install git+https://github.com/facebookresearch/sam2.git

4. Download Models

You can manually download the models from Hugging Face using the commands below. If skipped, the script will attempt to download them automatically during the first run.

huggingface-cli download Qwen/Qwen2.5-VL-7B-Instruct
huggingface-cli download Qwen/Qwen-Image-Edit-2509
huggingface-cli download facebook/sam2.1-hiera-large
huggingface-cli download IDEA-Research/Rex-Omni

Inference on Your Own Custom Images

Option 1: Using Proprietary MLLMs as the `Thinker`

Set API_BASE_URL and API_KEY in your .env file (e.g., for OpenAI).
Run the demo:

python demo/demo.py --image-path /path/to/your/image --task-instruction "your task instruction" --model-name gpt-4o

Option 2: Using Open-Source Qwen-2.5VL as the `Thinker`

Deploy the Qwen-2.5VL model to a local API server:

python inference/qwen_2_5vl_server.py

Set QWEN_2_5_URL in your .env file to your local server URL.
Start a new terminal and run the demo:

python demo/demo.py --image-path /path/to/your/image --task-instruction "your task instruction" --model-name qwen2.5vl-7b-instruct

Inference on Benchmark Datasets

Click to view arguments for inference/agent.py

--resume: Resume inference from where it left off.
--model-name: The model name to use (default: "gpt-4o-11-7").
--dataset-type: (Required) The dataset type ("UMD", "3DOI").
--dataset-path: (Required) Path to the dataset directory.
--api-url: API server URL. Defaults to API_BASE_URL or QWEN_2_5_URL from env vars.
--api-key: API key. Defaults to API_KEY from env vars.
--use-dreamer: Enable the "Dreamer" module to generate an imagined image using Qwen Image Edit Pipeline before reasoning.

Option 1: Using Proprietary MLLMs

Configure API_BASE_URL and API_KEY in .env, then run:

python inference/agent.py --dataset-type UMD --dataset-path /path/to/UMD --model-name gpt-4o --resume

Option 2: Using Open-Source Qwen-2.5VL

Start the server:

python inference/qwen_2_5vl_server.py

Configure QWEN_2_5_URL in .env, then start a new terminal and run:

python inference/agent.py --dataset-type UMD --dataset-path /path/to/UMD --model-name qwen2.5vl-7b-instruct --resume

Acknowledgement

RAGNet - We follow their implementation to test the performance on the RAGNet dataset.
Affordance-R1 - We follow their implementation to test the performance on the UMD dataset and ReasonAff Dataset.

Citation

If you find this work helpful, please cite:

@article{zhang2025a4agent,
    title={A4-Agent: An Agentic Framework for Zero-Shot Affordance Reasoning}, 
    author={Zhang, Zixin and Chen, Kanghao and Wang, Hanqing and Zhang, Hongfei and Chen, Harold Haodong and Liao, Chenfei and Guo, Litao and Chen, Ying-Cong},
    journal={arXiv preprint arXiv:2512.14442},
    year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
dataset		dataset
demo		demo
inference		inference
scripts		scripts
system_prompts		system_prompts
utils		utils
.env		.env
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A4-Agent: An Agentic Framework for Zero-Shot Affordance Reasoning

📰 News

Contents

Installation

1. Create a Conda Environment and Install PyTorch

2. Install Flash Attention

3. Install Rex-Omni & SAM2

4. Download Models

Inference on Your Own Custom Images

Option 1: Using Proprietary MLLMs as the `Thinker`

Option 2: Using Open-Source Qwen-2.5VL as the `Thinker`

Inference on Benchmark Datasets

Option 1: Using Proprietary MLLMs

Option 2: Using Open-Source Qwen-2.5VL

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Languages

EnVision-Research/A4-Agent

Folders and files

Latest commit

History

Repository files navigation

A4-Agent: An Agentic Framework for Zero-Shot Affordance Reasoning

📰 News

Contents

Installation

1. Create a Conda Environment and Install PyTorch

2. Install Flash Attention

3. Install Rex-Omni & SAM2

4. Download Models

Inference on Your Own Custom Images

Option 1: Using Proprietary MLLMs as the Thinker

Option 2: Using Open-Source Qwen-2.5VL as the Thinker

Inference on Benchmark Datasets

Option 1: Using Proprietary MLLMs

Option 2: Using Open-Source Qwen-2.5VL

Acknowledgement

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Option 1: Using Proprietary MLLMs as the `Thinker`

Option 2: Using Open-Source Qwen-2.5VL as the `Thinker`

Packages