|
|
|
|
|
|
+ Any Agent |
| OpenClaw | Claude Code | MetaClaw | PicoClaw | Nanobot | via Plugin |
π¨π³ δΈζ β’ π―π΅ ζ₯ζ¬θͺ β’ π°π· νκ΅μ΄ β’ πͺπΈ EspaΓ±ol β’ π«π· FranΓ§ais β’ π©πͺ Deutsch
π Overview β’ π Leaderboard β’ π Quick Start β’ π€ Supported Frameworks β’ π Data & Evaluation β’ π Case Studies β’ π¦ MetaClaw Integration β’ π Plugin System β’ π Documentation β’ ποΈ Project Structure β’ π οΈ Development β’ π Related Projects β’ π Citation β’ π License
ClawArena is a benchmark evaluation platform for AI coding agents. It provides a unified pipeline to run inference, score results, and compare performance across different agent frameworks on the same set of realistic, multi-session scenarios.
- 64 scenarios across 8 domains β Tech/HR, Hospital, NGO, Clinical, Content Creator, Finance, HR, Campus
- 1,879 evaluation rounds mixing multiple-choice reasoning and execution-based checks
- Multi-session context β agents must reason over workspace files, chat histories across multiple channels, and dynamic updates that arrive mid-evaluation
- Framework-agnostic β plug in any agent via adapters; 5 frameworks supported out of the box
- MetaClaw integration β evaluate agents enhanced with memory, skills, and RL
bash scripts/setup.shThis installs ClawArena and all supported framework CLIs (Claude Code, PicoClaw, Nanobot) in one command. See Installation Guide for manual setup and MetaClaw installation.
First refer to scripts/env_example.sh to configure the environment variables, then run:
python scripts/test_run.pyEdit scripts/test_run.py to configure frameworks, concurrency, timeout, and output path.
Or use the CLI directly
# Validate data integrity
clawarena check --data data/clawarena/tests.json
# Run inference for a single framework
clawarena infer --data data/clawarena/tests.json --framework openclaw --out results/
# Score results
clawarena score --infer-dir results/
# Generate report
clawarena report --score-dir results/ --out report/
# Full pipeline (infer + score + report + compare)
clawarena run --data data/clawarena/tests.json --frameworks openclaw,claude-code --out output/See CLI Reference for all commands and flags.
| Framework | Type | Language | Notes |
|---|---|---|---|
| OpenClaw | CLI agent | Node.js | β |
| MetaClaw | LLM Proxy | Python | Only supported within OpenClaw and Nanobot |
| Claude Code | CLI agent | Node.js | Assisted by Claude Code Router |
| PicoClaw | CLI agent | Go | β |
| Nanobot | CLI agent | Python | β |
New frameworks can be added via the plugin system without modifying core code.
β οΈ Billing & Policy Notice (April 4, 2026): Third-party tools/agents like OpenClaw may no longer route traffic via your personal Claude Free/Pro/Max subscription credentials. Any Claude integrations in ClawArena using Claude.ai OAuth login must switch to official API-key authentication via the Claude Console or supported cloud providers. Such third-party connections will now consume only your paid extra usage credits, not your subscription quota. Refer to Anthropic's legal and compliance for full policy details.
Each scenario contains:
- Workspace files β documents, spreadsheets, code that the agent can read
- Session histories β multi-channel chat logs (IM, email, Slack, etc.)
- Evaluation questions β
multi_choice(reasoning) andexec_check(execution verification) - Dynamic updates β new sessions and files injected between rounds
Two question types:
| Type | Tests | How |
|---|---|---|
multi_choice |
Agent's reasoning and comprehension | Extract \bbox{A,B,...} from response, compute IoU/F1 against ground truth |
exec_check |
Agent's actions and file output | Run shell commands to verify exit code and stdout |
Data construction pipeline (click to expand)
See Data Spec for the full six-layer specification system used to construct all 64 scenarios.
We have open-sourced the complete data construction specs β including the six-layer scenario design, synthesis guidelines, and pitfall documentation β in docs/data-spec/.
See Data Structure for the full format specification.
ClawArena supports MetaClaw as a transparent proxy layer for evaluating agents enhanced with memory, skills, and RL. Supported frameworks: OpenClaw and Nanobot.
Add a metaclaw field to tests.json:
{
"metaclaw": {
"enabled": true,
"managed": true,
"config_path": "metaclaw/memory.yaml",
"memory_trigger": { "every_n_rounds": 6, "on_last_round": true }
}
}See MetaClaw Guide for managed/unmanaged modes, trigger configuration, and YAML templates.
Add new framework adapters without modifying core code:
clawarena infer --data tests.json --framework my_agent --out results/ --plugin my_agent.pySee Plugin Guide for the adapter interface and engine round hooks.
| Document | Description |
|---|---|
| Installation | Setup guide for ClawArena, frameworks, and MetaClaw |
| CLI Reference | All commands, flags, and environment variables |
| Data Structure | Dataset format, question types, manifest schema |
| Provider Guide | LLM provider configuration and priority chain |
| MetaClaw Guide | MetaClaw integration modes and trigger hooks |
| Plugin Guide | Writing and registering external framework adapters |
ClawArena
βββ src/clawarena/
β βββ cli.py # CLI entry point
β βββ core/ # Pipeline: infer, scoring, report, compare, check, run, clean, stats
β βββ engines/ # Agent execution engines (per-framework)
β βββ data_handlers/ # Data loading, validation, work copy management (per-framework)
β βββ adapters/ # Framework adapter composition + registry
β βββ qtypes/ # Question types: multi_choice, exec_check
β βββ metaclaw/ # MetaClaw proxy lifecycle and trigger hooks
β βββ plugins/ # External adapter loading (--plugin)
βββ data/clawarena/ # Dataset: 64 scenarios, 1879 questions
βββ docs/ # Documentation
β βββ data-spec/ # Six-layer data construction specification
βββ scripts/ # Setup, test runner, comparison utilities
βββ helpers/ # Framework-specific helper hooks
βββ tests/ # Test suite (229 tests)
pip install -e ".[dev]"
pytestClawArena builds on and evaluates the following open-source agent frameworks:
- OpenClaw β the primary evaluated CLI agent.
- MetaClaw β meta-learning proxy that enhances agents with memory, skills, and RL.
- Claude Code β Anthropic's agentic coding tool.
- Claude Code Router β route Claude Code requests to different models.
- PicoClaw β lightweight Go-based CLI agent.
- Nanobot β Python-native CLI agent with Anthropic API support.
@article{ji2026clawarena,
title={ClawArena: A Multi-Framework Benchmark for Evaluating AI Coding Agents on Realistic Multi-Session Scenarios},
author={Ji, Haonian and Xiong, Kaiwen and Han, Siwei and Xia, Peng and Qiu, Shi and Zhou, Yiyang and Liu, Jiaqi and Li, Jinlong and Li, Bingzhou and Zheng, Zeyu and Xie, Cihang and Yao, Huaxiu},
journal={arXiv preprint arXiv:2604.04202},
year={2026}
}This project is licensed under the MIT License.






