Zixin Zhang1,4*, Kanghao Chen1,4*, Hanqing Wang1*, Hongfei Zhang1, Harold Haodong Chen1,4*, Chenfei Liao1,3, Litao Guo1, Ying-Cong Chen1,2✉
1HKUST(GZ)
2HKUST
3SJTU
4Knowin
*Equal contribution.
✉Corresponding author.
A4-Agent is an agentic framework designed for zero-shot affordance reasoning. Given an observed object, it integrates image generation, object detection, segmentation, and Vision-Language Models (VLMs) to imagine plausible interactions and localize action-specific parts. A4-Agent achieves state-of-the-art performance across multiple benchmarks in a zero-shot setting, outperforming baseline models specifically trained for affordance prediction tasks.
- [2025-12] Arxiv paper is now available.
- [2025-12] We release the code and website for A4-Agent.
- Installation
- Inference on Your Own Custom Images
- Inference on Benchmark Datasets
- Acknowledgement
- Citation
conda create -n a4-agent python=3.11
conda activate a4-agent
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121Flash Attention is required for Rex-Omni. We strongly recommend installing Flash Attention using a pre-built wheel to avoid compilation issues.
You can find the pre-built wheel for your system here. For the environment setup above, use:
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.5cxx11abiFALSE-cp311-cp311-linux_x86_64.whl
pip install -r requirements.txtpip install git+https://github.com/IDEA-Research/Rex-Omni.git --no-deps
pip install git+https://github.com/facebookresearch/sam2.gitYou can manually download the models from Hugging Face using the commands below. If skipped, the script will attempt to download them automatically during the first run.
huggingface-cli download Qwen/Qwen2.5-VL-7B-Instruct
huggingface-cli download Qwen/Qwen-Image-Edit-2509
huggingface-cli download facebook/sam2.1-hiera-large
huggingface-cli download IDEA-Research/Rex-Omni- Set
API_BASE_URLandAPI_KEYin your.envfile (e.g., for OpenAI). - Run the demo:
python demo/demo.py --image-path /path/to/your/image --task-instruction "your task instruction" --model-name gpt-4o- Deploy the Qwen-2.5VL model to a local API server:
python inference/qwen_2_5vl_server.py- Set
QWEN_2_5_URLin your.envfile to your local server URL. - Start a new terminal and run the demo:
python demo/demo.py --image-path /path/to/your/image --task-instruction "your task instruction" --model-name qwen2.5vl-7b-instructClick to view arguments for inference/agent.py
--resume: Resume inference from where it left off.--model-name: The model name to use (default: "gpt-4o-11-7").--dataset-type: (Required) The dataset type ("UMD", "3DOI").--dataset-path: (Required) Path to the dataset directory.--api-url: API server URL. Defaults toAPI_BASE_URLorQWEN_2_5_URLfrom env vars.--api-key: API key. Defaults toAPI_KEYfrom env vars.--use-dreamer: Enable the "Dreamer" module to generate an imagined image using Qwen Image Edit Pipeline before reasoning.
Configure API_BASE_URL and API_KEY in .env, then run:
python inference/agent.py --dataset-type UMD --dataset-path /path/to/UMD --model-name gpt-4o --resume- Start the server:
python inference/qwen_2_5vl_server.py- Configure
QWEN_2_5_URLin.env, then start a new terminal and run:
python inference/agent.py --dataset-type UMD --dataset-path /path/to/UMD --model-name qwen2.5vl-7b-instruct --resume- RAGNet - We follow their implementation to test the performance on the RAGNet dataset.
- Affordance-R1 - We follow their implementation to test the performance on the UMD dataset and ReasonAff Dataset.
If you find this work helpful, please cite:
@article{zhang2025a4agent,
title={A4-Agent: An Agentic Framework for Zero-Shot Affordance Reasoning},
author={Zhang, Zixin and Chen, Kanghao and Wang, Hanqing and Zhang, Hongfei and Chen, Harold Haodong and Liao, Chenfei and Guo, Litao and Chen, Ying-Cong},
journal={arXiv preprint arXiv:2512.14442},
year={2025}
}