GitHub - aiming-lab/ClawArena

Benchmarking AI Agents in Evolving Information Environments.

					+ Any Agent
OpenClaw	Claude Code	MetaClaw	PicoClaw	Nanobot	via Plugin

🇨🇳 中文 • 🇯🇵 日本語 • 🇰🇷 한국어 • 🇪🇸 Español • 🇫🇷 Français • 🇩🇪 Deutsch

🔭 Overview • 📈 Leaderboard • 🚀 Quick Start • 🤖 Supported Frameworks • 📊 Data & Evaluation • 🔍 Case Studies • 🦞 MetaClaw Integration • 🔌 Plugin System • 📖 Documentation • 🏗️ Project Structure • 🛠️ Development • 🙏 Related Projects • 📚 Citation • 📄 License

🔭 Overview

ClawArena is a benchmark evaluation platform for AI coding agents. It provides a unified pipeline to run inference, score results, and compare performance across different agent frameworks on the same set of realistic, multi-session scenarios.

64 scenarios across 8 domains — Tech/HR, Hospital, NGO, Clinical, Content Creator, Finance, HR, Campus
1,879 evaluation rounds mixing multiple-choice reasoning and execution-based checks
Multi-session context — agents must reason over workspace files, chat histories across multiple channels, and dynamic updates that arrive mid-evaluation
Framework-agnostic — plug in any agent via adapters; 5 frameworks supported out of the box
MetaClaw integration — evaluate agents enhanced with memory, skills, and RL

📈 Leaderboard

🚀 Quick Start

1. Install everything

bash scripts/setup.sh

This installs ClawArena and all supported framework CLIs (Claude Code, PicoClaw, Nanobot) in one command. See Installation Guide for manual setup and MetaClaw installation.

2. Run the benchmark

First refer to scripts/env_example.sh to configure the environment variables, then run:

python scripts/test_run.py

Edit scripts/test_run.py to configure frameworks, concurrency, timeout, and output path.

Or use the CLI directly

# Validate data integrity
clawarena check --data data/clawarena/tests.json

# Run inference for a single framework
clawarena infer --data data/clawarena/tests.json --framework openclaw --out results/

# Score results
clawarena score --infer-dir results/

# Generate report
clawarena report --score-dir results/ --out report/

# Full pipeline (infer + score + report + compare)
clawarena run --data data/clawarena/tests.json --frameworks openclaw,claude-code --out output/

See CLI Reference for all commands and flags.

🤖 Supported Frameworks

Framework	Type	Language	Notes
OpenClaw	CLI agent	Node.js	—
MetaClaw	LLM Proxy	Python	Only supported within OpenClaw and Nanobot
Claude Code	CLI agent	Node.js	Assisted by Claude Code Router
PicoClaw	CLI agent	Go	—
Nanobot	CLI agent	Python	—

New frameworks can be added via the plugin system without modifying core code.

⚠️ Billing & Policy Notice (April 4, 2026): Third-party tools/agents like OpenClaw may no longer route traffic via your personal Claude Free/Pro/Max subscription credentials. Any Claude integrations in ClawArena using Claude.ai OAuth login must switch to official API-key authentication via the Claude Console or supported cloud providers. Such third-party connections will now consume only your paid extra usage credits, not your subscription quota. Refer to Anthropic's legal and compliance for full policy details.

📊 Data & Evaluation

Each scenario contains:

Workspace files — documents, spreadsheets, code that the agent can read
Session histories — multi-channel chat logs (IM, email, Slack, etc.)
Evaluation questions — multi_choice (reasoning) and exec_check (execution verification)
Dynamic updates — new sessions and files injected between rounds

Two question types:

Type	Tests	How
`multi_choice`	Agent's reasoning and comprehension	Extract `\bbox{A,B,...}` from response, compute IoU/F1 against ground truth
`exec_check`	Agent's actions and file output	Run shell commands to verify exit code and stdout

Data construction pipeline (click to expand)

See Data Spec for the full six-layer specification system used to construct all 64 scenarios.

We have open-sourced the complete data construction specs — including the six-layer scenario design, synthesis guidelines, and pitfall documentation — in docs/data-spec/.

See Data Structure for the full format specification.

🔍 Case Studies

Case 1–2: Multi-Source Conflict Reasoning & Dynamic Belief Revision

Case 3–4: Self-Diagnostic Accuracy & Implicit Preference Compliance Audit

Case 5–6: Execution-Verified Bug Fix & Statistical Methodology Conflict

Case 7–8: Norm Retroactivity Bias & Full-Dimension Integration

🦞 MetaClaw Integration

ClawArena supports MetaClaw as a transparent proxy layer for evaluating agents enhanced with memory, skills, and RL. Supported frameworks: OpenClaw and Nanobot.

Add a metaclaw field to tests.json:

{
  "metaclaw": {
    "enabled": true,
    "managed": true,
    "config_path": "metaclaw/memory.yaml",
    "memory_trigger": { "every_n_rounds": 6, "on_last_round": true }
  }
}

See MetaClaw Guide for managed/unmanaged modes, trigger configuration, and YAML templates.

🔌 Plugin System

Add new framework adapters without modifying core code:

clawarena infer --data tests.json --framework my_agent --out results/ --plugin my_agent.py

See Plugin Guide for the adapter interface and engine round hooks.

📖 Documentation

Document	Description
Installation	Setup guide for ClawArena, frameworks, and MetaClaw
CLI Reference	All commands, flags, and environment variables
Data Structure	Dataset format, question types, manifest schema
Provider Guide	LLM provider configuration and priority chain
MetaClaw Guide	MetaClaw integration modes and trigger hooks
Plugin Guide	Writing and registering external framework adapters

🏗️ Project Structure

ClawArena
├── src/clawarena/
│   ├── cli.py               # CLI entry point
│   ├── core/                # Pipeline: infer, scoring, report, compare, check, run, clean, stats
│   ├── engines/             # Agent execution engines (per-framework)
│   ├── data_handlers/       # Data loading, validation, work copy management (per-framework)
│   ├── adapters/            # Framework adapter composition + registry
│   ├── qtypes/              # Question types: multi_choice, exec_check
│   ├── metaclaw/            # MetaClaw proxy lifecycle and trigger hooks
│   └── plugins/             # External adapter loading (--plugin)
├── data/clawarena/          # Dataset: 64 scenarios, 1879 questions
├── docs/                    # Documentation
│   └── data-spec/           # Six-layer data construction specification
├── scripts/                 # Setup, test runner, comparison utilities
├── helpers/                 # Framework-specific helper hooks
└── tests/                   # Test suite (229 tests)

🛠️ Development

pip install -e ".[dev]"
pytest

🙏 Related Projects

ClawArena builds on and evaluates the following open-source agent frameworks:

OpenClaw — the primary evaluated CLI agent.
MetaClaw — meta-learning proxy that enhances agents with memory, skills, and RL.
Claude Code — Anthropic's agentic coding tool.
Claude Code Router — route Claude Code requests to different models.
PicoClaw — lightweight Go-based CLI agent.
Nanobot — Python-native CLI agent with Anthropic API support.

📚 Citation

@article{ji2026clawarena,
  title={ClawArena: A Multi-Framework Benchmark for Evaluating AI Coding Agents on Realistic Multi-Session Scenarios},
  author={Ji, Haonian and Xiong, Kaiwen and Han, Siwei and Xia, Peng and Qiu, Shi and Zhou, Yiyang and Liu, Jiaqi and Li, Jinlong and Li, Bingzhou and Zheng, Zeyu and Xie, Cihang and Yao, Huaxiu},
  journal={arXiv preprint arXiv:2604.04202},
  year={2026}
}

📄 License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
data/clawarena		data/clawarena
docs		docs
helpers		helpers
scripts		scripts
src/clawarena		src/clawarena
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmarking AI Agents in Evolving Information Environments.

🔭 Overview

📈 Leaderboard

🚀 Quick Start

1. Install everything

2. Run the benchmark

🤖 Supported Frameworks

📊 Data & Evaluation

🔍 Case Studies

🦞 MetaClaw Integration

🔌 Plugin System

📖 Documentation

🏗️ Project Structure

🛠️ Development

🙏 Related Projects

📚 Citation

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Benchmarking AI Agents in Evolving Information Environments.

🔭 Overview

📈 Leaderboard

🚀 Quick Start

1. Install everything

2. Run the benchmark

🤖 Supported Frameworks

📊 Data & Evaluation

🔍 Case Studies

🦞 MetaClaw Integration

🔌 Plugin System

📖 Documentation

🏗️ Project Structure

🛠️ Development

🙏 Related Projects

📚 Citation

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages