Skip to content

aiming-lab/ClawArena

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ClawArena

Benchmarking AI Agents in Evolving Information Environments.


ClawArena Overview

OpenClaw Claude Code MetaClaw PicoClaw Nanobot + Any Agent
OpenClaw Claude Code MetaClaw PicoClaw Nanobot via Plugin

πŸ‡¨πŸ‡³ δΈ­ζ–‡ β€’ πŸ‡―πŸ‡΅ ζ—₯本θͺž β€’ πŸ‡°πŸ‡· ν•œκ΅­μ–΄ β€’ πŸ‡ͺπŸ‡Έ EspaΓ±ol β€’ πŸ‡«πŸ‡· FranΓ§ais β€’ πŸ‡©πŸ‡ͺ Deutsch


arXiv Website GitHub License MIT PRs welcome

Python β‰₯3.10 64 Scenarios 1879 Questions 5 Frameworks

πŸ”­ Overview β€’ πŸ“ˆ Leaderboard β€’ πŸš€ Quick Start β€’ πŸ€– Supported Frameworks β€’ πŸ“Š Data & Evaluation β€’ πŸ” Case Studies β€’ 🦞 MetaClaw Integration β€’ πŸ”Œ Plugin System β€’ πŸ“– Documentation β€’ πŸ—οΈ Project Structure β€’ πŸ› οΈ Development β€’ πŸ™ Related Projects β€’ πŸ“š Citation β€’ πŸ“„ License


πŸ”­ Overview

ClawArena is a benchmark evaluation platform for AI coding agents. It provides a unified pipeline to run inference, score results, and compare performance across different agent frameworks on the same set of realistic, multi-session scenarios.

  • 64 scenarios across 8 domains β€” Tech/HR, Hospital, NGO, Clinical, Content Creator, Finance, HR, Campus
  • 1,879 evaluation rounds mixing multiple-choice reasoning and execution-based checks
  • Multi-session context β€” agents must reason over workspace files, chat histories across multiple channels, and dynamic updates that arrive mid-evaluation
  • Framework-agnostic β€” plug in any agent via adapters; 5 frameworks supported out of the box
  • MetaClaw integration β€” evaluate agents enhanced with memory, skills, and RL
ClawArena Dataset

πŸ“ˆ Leaderboard

ClawArena Leaderboard

πŸš€ Quick Start

1. Install everything

bash scripts/setup.sh

This installs ClawArena and all supported framework CLIs (Claude Code, PicoClaw, Nanobot) in one command. See Installation Guide for manual setup and MetaClaw installation.

2. Run the benchmark

First refer to scripts/env_example.sh to configure the environment variables, then run:

python scripts/test_run.py

Edit scripts/test_run.py to configure frameworks, concurrency, timeout, and output path.

Or use the CLI directly
# Validate data integrity
clawarena check --data data/clawarena/tests.json

# Run inference for a single framework
clawarena infer --data data/clawarena/tests.json --framework openclaw --out results/

# Score results
clawarena score --infer-dir results/

# Generate report
clawarena report --score-dir results/ --out report/

# Full pipeline (infer + score + report + compare)
clawarena run --data data/clawarena/tests.json --frameworks openclaw,claude-code --out output/

See CLI Reference for all commands and flags.


πŸ€– Supported Frameworks

Framework Type Language Notes
OpenClaw CLI agent Node.js β€”
MetaClaw LLM Proxy Python Only supported within OpenClaw and Nanobot
Claude Code CLI agent Node.js Assisted by Claude Code Router
PicoClaw CLI agent Go β€”
Nanobot CLI agent Python β€”

New frameworks can be added via the plugin system without modifying core code.

⚠️ Billing & Policy Notice (April 4, 2026): Third-party tools/agents like OpenClaw may no longer route traffic via your personal Claude Free/Pro/Max subscription credentials. Any Claude integrations in ClawArena using Claude.ai OAuth login must switch to official API-key authentication via the Claude Console or supported cloud providers. Such third-party connections will now consume only your paid extra usage credits, not your subscription quota. Refer to Anthropic's legal and compliance for full policy details.


πŸ“Š Data & Evaluation

Each scenario contains:

  • Workspace files β€” documents, spreadsheets, code that the agent can read
  • Session histories β€” multi-channel chat logs (IM, email, Slack, etc.)
  • Evaluation questions β€” multi_choice (reasoning) and exec_check (execution verification)
  • Dynamic updates β€” new sessions and files injected between rounds

Two question types:

Type Tests How
multi_choice Agent's reasoning and comprehension Extract \bbox{A,B,...} from response, compute IoU/F1 against ground truth
exec_check Agent's actions and file output Run shell commands to verify exit code and stdout
Data construction pipeline (click to expand)
ClawArena Construction Pipeline

See Data Spec for the full six-layer specification system used to construct all 64 scenarios.

We have open-sourced the complete data construction specs β€” including the six-layer scenario design, synthesis guidelines, and pitfall documentation β€” in docs/data-spec/.

See Data Structure for the full format specification.


πŸ” Case Studies

Case 1–2: Multi-Source Conflict Reasoning & Dynamic Belief Revision
Case 1-2
Case 3–4: Self-Diagnostic Accuracy & Implicit Preference Compliance Audit
Case 3-4
Case 5–6: Execution-Verified Bug Fix & Statistical Methodology Conflict
Case 5-6
Case 7–8: Norm Retroactivity Bias & Full-Dimension Integration
Case 7-8

🦞 MetaClaw Integration

ClawArena supports MetaClaw as a transparent proxy layer for evaluating agents enhanced with memory, skills, and RL. Supported frameworks: OpenClaw and Nanobot.

Add a metaclaw field to tests.json:

{
  "metaclaw": {
    "enabled": true,
    "managed": true,
    "config_path": "metaclaw/memory.yaml",
    "memory_trigger": { "every_n_rounds": 6, "on_last_round": true }
  }
}

See MetaClaw Guide for managed/unmanaged modes, trigger configuration, and YAML templates.


πŸ”Œ Plugin System

Add new framework adapters without modifying core code:

clawarena infer --data tests.json --framework my_agent --out results/ --plugin my_agent.py

See Plugin Guide for the adapter interface and engine round hooks.


πŸ“– Documentation

Document Description
Installation Setup guide for ClawArena, frameworks, and MetaClaw
CLI Reference All commands, flags, and environment variables
Data Structure Dataset format, question types, manifest schema
Provider Guide LLM provider configuration and priority chain
MetaClaw Guide MetaClaw integration modes and trigger hooks
Plugin Guide Writing and registering external framework adapters

πŸ—οΈ Project Structure

ClawArena
β”œβ”€β”€ src/clawarena/
β”‚   β”œβ”€β”€ cli.py               # CLI entry point
β”‚   β”œβ”€β”€ core/                # Pipeline: infer, scoring, report, compare, check, run, clean, stats
β”‚   β”œβ”€β”€ engines/             # Agent execution engines (per-framework)
β”‚   β”œβ”€β”€ data_handlers/       # Data loading, validation, work copy management (per-framework)
β”‚   β”œβ”€β”€ adapters/            # Framework adapter composition + registry
β”‚   β”œβ”€β”€ qtypes/              # Question types: multi_choice, exec_check
β”‚   β”œβ”€β”€ metaclaw/            # MetaClaw proxy lifecycle and trigger hooks
β”‚   └── plugins/             # External adapter loading (--plugin)
β”œβ”€β”€ data/clawarena/          # Dataset: 64 scenarios, 1879 questions
β”œβ”€β”€ docs/                    # Documentation
β”‚   └── data-spec/           # Six-layer data construction specification
β”œβ”€β”€ scripts/                 # Setup, test runner, comparison utilities
β”œβ”€β”€ helpers/                 # Framework-specific helper hooks
└── tests/                   # Test suite (229 tests)

πŸ› οΈ Development

pip install -e ".[dev]"
pytest

πŸ™ Related Projects

ClawArena builds on and evaluates the following open-source agent frameworks:

  • OpenClaw β€” the primary evaluated CLI agent.
  • MetaClaw β€” meta-learning proxy that enhances agents with memory, skills, and RL.
  • Claude Code β€” Anthropic's agentic coding tool.
  • Claude Code Router β€” route Claude Code requests to different models.
  • PicoClaw β€” lightweight Go-based CLI agent.
  • Nanobot β€” Python-native CLI agent with Anthropic API support.

πŸ“š Citation

@article{ji2026clawarena,
  title={ClawArena: A Multi-Framework Benchmark for Evaluating AI Coding Agents on Realistic Multi-Session Scenarios},
  author={Ji, Haonian and Xiong, Kaiwen and Han, Siwei and Xia, Peng and Qiu, Shi and Zhou, Yiyang and Liu, Jiaqi and Li, Jinlong and Li, Bingzhou and Zheng, Zeyu and Xie, Cihang and Yao, Huaxiu},
  journal={arXiv preprint arXiv:2604.04202},
  year={2026}
}

πŸ“„ License

This project is licensed under the MIT License.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages