Skip to content

EnVision-Research/A4-Agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A4-Agent: An Agentic Framework for Zero-Shot Affordance Reasoning

Paper Project Page

Zixin Zhang1,4*, Kanghao Chen1,4*, Hanqing Wang1*, Hongfei Zhang1, Harold Haodong Chen1,4*, Chenfei Liao1,3, Litao Guo1, Ying-Cong Chen1,2✉

1HKUST(GZ) 2HKUST 3SJTU 4Knowin
*Equal contribution. Corresponding author.

A4-Agent

A4-Agent is an agentic framework designed for zero-shot affordance reasoning. Given an observed object, it integrates image generation, object detection, segmentation, and Vision-Language Models (VLMs) to imagine plausible interactions and localize action-specific parts. A4-Agent achieves state-of-the-art performance across multiple benchmarks in a zero-shot setting, outperforming baseline models specifically trained for affordance prediction tasks.

📰 News

  • [2025-12] Arxiv paper is now available.
  • [2025-12] We release the code and website for A4-Agent.

Contents

Installation

1. Create a Conda Environment and Install PyTorch

conda create -n a4-agent python=3.11
conda activate a4-agent
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121

2. Install Flash Attention

Flash Attention is required for Rex-Omni. We strongly recommend installing Flash Attention using a pre-built wheel to avoid compilation issues.

You can find the pre-built wheel for your system here. For the environment setup above, use:

pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.5cxx11abiFALSE-cp311-cp311-linux_x86_64.whl
pip install -r requirements.txt

3. Install Rex-Omni & SAM2

pip install git+https://github.com/IDEA-Research/Rex-Omni.git --no-deps
pip install git+https://github.com/facebookresearch/sam2.git

4. Download Models

You can manually download the models from Hugging Face using the commands below. If skipped, the script will attempt to download them automatically during the first run.

huggingface-cli download Qwen/Qwen2.5-VL-7B-Instruct
huggingface-cli download Qwen/Qwen-Image-Edit-2509
huggingface-cli download facebook/sam2.1-hiera-large
huggingface-cli download IDEA-Research/Rex-Omni

Inference on Your Own Custom Images

Option 1: Using Proprietary MLLMs as the Thinker

  1. Set API_BASE_URL and API_KEY in your .env file (e.g., for OpenAI).
  2. Run the demo:
python demo/demo.py --image-path /path/to/your/image --task-instruction "your task instruction" --model-name gpt-4o

Option 2: Using Open-Source Qwen-2.5VL as the Thinker

  1. Deploy the Qwen-2.5VL model to a local API server:
python inference/qwen_2_5vl_server.py
  1. Set QWEN_2_5_URL in your .env file to your local server URL.
  2. Start a new terminal and run the demo:
python demo/demo.py --image-path /path/to/your/image --task-instruction "your task instruction" --model-name qwen2.5vl-7b-instruct

Inference on Benchmark Datasets

Click to view arguments for inference/agent.py
  • --resume: Resume inference from where it left off.
  • --model-name: The model name to use (default: "gpt-4o-11-7").
  • --dataset-type: (Required) The dataset type ("UMD", "3DOI").
  • --dataset-path: (Required) Path to the dataset directory.
  • --api-url: API server URL. Defaults to API_BASE_URL or QWEN_2_5_URL from env vars.
  • --api-key: API key. Defaults to API_KEY from env vars.
  • --use-dreamer: Enable the "Dreamer" module to generate an imagined image using Qwen Image Edit Pipeline before reasoning.

Option 1: Using Proprietary MLLMs

Configure API_BASE_URL and API_KEY in .env, then run:

python inference/agent.py --dataset-type UMD --dataset-path /path/to/UMD --model-name gpt-4o --resume

Option 2: Using Open-Source Qwen-2.5VL

  1. Start the server:
python inference/qwen_2_5vl_server.py
  1. Configure QWEN_2_5_URL in .env, then start a new terminal and run:
python inference/agent.py --dataset-type UMD --dataset-path /path/to/UMD --model-name qwen2.5vl-7b-instruct --resume

Acknowledgement

  • RAGNet - We follow their implementation to test the performance on the RAGNet dataset.
  • Affordance-R1 - We follow their implementation to test the performance on the UMD dataset and ReasonAff Dataset.

Citation

If you find this work helpful, please cite:

@article{zhang2025a4agent,
    title={A4-Agent: An Agentic Framework for Zero-Shot Affordance Reasoning}, 
    author={Zhang, Zixin and Chen, Kanghao and Wang, Hanqing and Zhang, Hongfei and Chen, Harold Haodong and Liao, Chenfei and Guo, Litao and Chen, Ying-Cong},
    journal={arXiv preprint arXiv:2512.14442},
    year={2025}
}

About

A4-Agent: An Agentic Framework for Zero-Shot Affordance Reasoning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published