.
├── eval/ # Core training / inference pipeline
│ ├── train.py # Training-stage rollout + experience / skill update
│ ├── infer.py # Inference-stage rollout with retrieval
│ ├── cluster.py # K-Means clustering for prioritized organization
│ ├── ace_skill/ # Experience manager, retriever, weighted sampler, skill builder
│ ├── engine/ # API caller, tool handler, model dispatch
│ ├── tools/ # Tool implementations (code interpreter, web / image search, ...)
│ ├── search/ # SearchNode tree abstraction for reasoning state
│ ├── prompts/ # Prompt templates (reasoning, experience, skill, judge)
│ ├── configs/ # Tool configs (e.g. tool_configs.yaml)
│ └── utils/ # Shared utilities
├── benchmark/ # Train and test data
├── memory_bank/ # Auto-generated experience / skill libraries per dataset
├── configs/ # Top-level run configs (e.g. prioritize.yaml)
├── run_ace_skill.sh # End-to-end train / inference launcher
├── requirements.txt # Python dependencies
└── README.md
pip install -r requirements.txtAce-Skill needs three kinds of credentials: LLM endpoints (reasoning / verifier / experience generator + embedding model), web-tool API keys (only for benchmarks that browse the web), and an image-host key used by image_search.
# --- Reasoning / verifier / experience LLMs ---------------------------------
# Primary endpoint (used for reasoning + experience generation)
REASONING_API_KEY=...
REASONING_END_POINT=...
# Secondary endpoint (used for the verifier and as a fallback)
REASONING_API_KEY_2=...
REASONING_END_POINT_2=...
# --- Embedding model for experience retrieval ------------------------------
EXPERIENCE_EMBEDDING_API_KEY=...
EXPERIENCE_EMBEDDING_ENDPOINT=...
# --- Web-tool credentials (only required for benchmarks that use them) -----
JINA_API_KEY=... # used by `visit` for page parsing
SERPAPI_KEY=... # used by `web_search`Most fields in eval/configs/tool_configs.yaml are timeouts / limits with sensible defaults — the only entries you typically need to touch are:
image_search:
imgbb_api_key: "<your-imgbb-key>" # required for image_search; get one at https://api.imgbb.com/
code_interpreter:
work_dir: "workspace/code_interpreter" # base dir for per-instance scratch dirs (auto-created)
zoom:
work_dir: "workspace/zoom" # base dir for per-instance scratch dirs (auto-created)Tip: only fill the credentials for tools listed in
ENABLED_TOOLSfor your dataset (see theif-branch inrun_ace_skill.sh). For example,tir-benchonly needscode_interpreterand works without any web or imgbb keys.
Ace-Skill is evaluated on four multimodal-agent benchmarks. Pick the ones you need and download them from Hugging Face:
| Benchmark | Link |
|---|---|
| VisualToolBench | DjangoJungle/VisualToolBench |
| MMSearch-Plus | DjangoJungle/MMSearch-Plus |
| TIR-Bench | DjangoJungle/TIR-Bench |
| AgentVista | Warrieryes/AgentVista |
Place each dataset under benchmark/<DatasetName>/ so that the train / test JSONs and image folders match the paths referenced by run_ace_skill.sh. For example, TIR-Bench should look like:
benchmark/TIR-Bench/
├── data/ # images referenced by samples
├── train.json
└── test.json
Generate the doc_id → cluster_id mapping that the clustered organizer relies on. The script fits K-Means on the training set and predicts cluster ids for the test set, writing two JSON files next to the inputs:
python eval/cluster.py \
benchmark/TIR-Bench/train.json \
benchmark/TIR-Bench/test.json \
-k 5This produces train_doc_id_to_cluster.json and test_doc_id_to_cluster.json in the same directory; run_ace_skill.sh will look for these files automatically.
run_ace_skill.sh wraps the full pipeline. Datasets are switched via DATASET_NAME; run identifiers control where logs, outputs and the memory bank are written.
# Self-evolving training
DATASET_NAME=tir-bench RUN_ID=my-run bash run_ace_skill.sh train
# Inference using the libraries built during training
DATASET_NAME=tir-bench RUN_ID=my-run INF_ID=my-run-infer bash run_ace_skill.sh inference
# Train then infer in one go
DATASET_NAME=tir-bench RUN_ID=my-run INF_ID=my-run-infer bash run_ace_skill.sh allOutputs land in output/<RUN_ID>/ and output/<INF_ID>/, logs in logs/, and the experience / skill libraries in memory_bank/<RUN_ID>/.
Ace-Skill builds on and is inspired by prior work on self-evolving and memory-augmented agents. We thank the authors for open-sourcing their code.
- XSkill — a framework for continual learning from experiences and skills in multimodal agents. Ace-Skill's experience–skill accumulation pipeline, including the
eval/ace_skill/modules, the tool-calling engine, and the overall launch-script structure, is built on the XSkill codebase. - MemEvolve — a meta-evolution framework that jointly evolves the memory content and the memory architecture itself.
If you find Ace-Skill useful in your research, please cite:
@article{xiong2026aceskill,
title={Ace-Skill: Bootstrapping Multimodal Agents with Prioritized and Clustered Evolution},
author={Feng Xiong and Zengbin Wang and Yong Wang and Xuecai Hu and Jinghan He and Liang Lin and Yuan Liu and Xiangxiang Chu},
year={2026},
eprint={2605.08887},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2605.08887},
}