Tianci Xue,1, Weijian Qi*,1, Tianneng Shi*2, Chan Hee Song1, Boyu Gou1, Dawn Song,2, Huan Sun†,1 Yu Su†,1
1The Ohio State University, 2University of California, Berkeley
*Equal contribution, †Equal advising
📃 Paper • 📃 Blog • 🏆 Leaderboard • 🤗 Data
- [05/11/2025] Check out our updates in the paper.
- The performance of Claude Computer Use 3.7.
- WebJudge(o4-mini) achieves high agreement (86%) with a low success rate gap (3.8%) compared with humans.
- Release WebJudge-7B, a robust and reliable reward model for Reinforcement learning.
Online-Mind2Web includes 300 diverse tasks from 136 popular websites across various domains. It covers a diverse set of real-world user tasks, such as clothing, food, housing, and transportation, to evaluate web agents' performance in a real-world online environment.
We will regularly update Online-Mind2Web by replacing outdated or invalid tasks (e.g., due to website changes) to maintain its value as a rigorous benchmark for web agents. If you find any tasks are outdated, please reach out to us, and we will update them.
To ensure fair comparisons, we will aim to keep the updated tasks on the same websites as before and with a similar reference length. Additionally, once agent performance saturates on Online-Mind2Web, we will also revise simple tasks to preserve its long-term value.
To enhance the reliability and scalability of the evaluation process in online environments, We propose a more reliable automatic evaluation method called WebJudge, which consists of three parts. (1) Key Point Identification: The model is prompted to identify several key points necessary for completing the task, based on the given instruction and task description. (2) Key Screenshot Identification: Important screenshots are selected from the agent’s trajectory to retain relevant visual evidence while discarding uninformative frames. (3) Outcome Judgment: Output the judgement result based on the task description, key points, key screenshots, and the action history. Our method preserves critical intermediate screenshots while mitigating the token overload issue.
Model | Auto-Eval | SeeAct | Agent-E | Browser Use | Claude 3.5 | Claude 3.7 | Operator | Avg AR |
---|---|---|---|---|---|---|---|---|
GPT-4o | Autonomous Eval | 84.7 | 85.0 | 76.0 | 83.7 | 75.5 | 71.7 | 79.4 |
AgentTrek Eval | 73.0 | 64.3 | 63.3 | -- | -- | -- | 66.9 | |
WebVoyager | -- | 75.3 | 71.3 | 74.0 | 72.0 | 76.7 | 73.9 | |
WebJudge | 86.7 | 86.0 | 81.4 | 86.3 | 79.1 | 81.8 | 83.6 | |
o4-mini | Autonomous Eval | 79.7 | 85.7 | 86.0 | 84.3 | 68.0 | 73.3 | 79.5 |
WebVoyager | -- | 80.3 | 79.0 | 81.7 | 74.3 | 78.3 | 78.7 | |
WebJudge | 85.3 | 86.3 | 89.3 | 87.0 | 82.3 | 83.7 | 85.7 | |
WebJudge-7B | 86.0 | 87.3 | 88.3 | 89.7 | 84.3 | 86.3 | 87.0 |
Excellent generalization capabilities on AgentRewardBench (5 OOD benchmarks)
Methods | AB | VWA | WA | Work | Wk++ | Overall |
---|---|---|---|---|---|---|
Rule-based* | 25.0 | 85.2 | 79.0 | 100.0 | 83.3 | 83.8 |
Autonomous Eval* | 83.3 | 61.2 | 67.6 | 96.4 | 59.3 | 67.6 |
GPT-4o (A11y Tree)* | 77.8 | 63.0 | 70.2 | 94.6 | 63.0 | 69.8 |
WebJudge (GPT-4o) | 66.7 | 69.8 | 72.6 | 92.3 | 75.0 | 73.7 |
WebJudge-7B | 80.0 | 66.7 | 77.5 | 100.0 | 70.0 | 75.7 |
WebJudge (o4-mini) | 100.0 | 74.5 | 81.2 | 100.0 | 90.0 | 82.0 |
WebJudge significantly outperforms existing methods, achieving impressive overall precision of 73.7% 75.7% and 82.0% on WebArena (WA), VisualWebArena (VWA), AssistantBench (AB), WorkArena (Work) and WorkArena++ (Wk++) across 1302 trajectories.
The high precision suggests that WebJudge holds potential as a robust and scalable reward model for downstream applications such as Rejection Sampling Fine-Tuning, Reflection, and Reinforcement Learning.
We have released the fine-tuned WebJudge-7B weights, which are now available on Hugging Face.
Create a conda environment and install dependencies:
conda create -n Online_Mind2Web python=3.11
conda activate Online_Mind2Web
pip install -r requirements.txt
You can run the provided example evaluation script directly to perform the evaluation. Adjust the "mode" parameter to choose among various auto-eval methods.
bash ./script/eval.sh
Important
- Start from the specified websites, not Google Search:To enable fair comparisons, please ensure that each task starts from the specified website in our benchmark. Starting from Google Search or alternative websites can lead agents to use different websites to solve the task, resulting in varying difficulty levels and potentially skewed evaluation results.
- Include only factual actions, not agent outputs: The action history should contain only the factual actions taken by the agent to complete the task (e.g., clicking elements and Typing text). Do not include the final response or any other agent's outputs, as they may contain hallucinated content and result in a high rate of false positives.
- Use o4-mini for WebJudge: WebJudge powered by o4-mini demonstrates a higher alignment with human judgment, achieving an average agreement rate of 85.7% and maintaining a narrow success rate gap of just 3.8%. Therefore, please use o4-mini as the backbone for automatic evaluation.
In certain scenarios, testing on the full Online-Mind2Web dataset may not be feasible due to cost, privacy, or legal constraints. To facilitate fair and apple-to-apple comparisons, we release both our human evaluation labels and auto-eval details.
- Human Evaluation: Task-level human evaluation labels are provided in the file.
- Auto-Evaluation: The results of WebJudge are available in the folder.
Note: Online-Mind2Web is derived from the original Mind2Web dataset. We kindly ask that you cite both the original and this work when using or referencing the data.
@article{xue2025illusionprogressassessingcurrent,
title={An Illusion of Progress? Assessing the Current State of Web Agents},
author={Tianci Xue and Weijian Qi and Tianneng Shi and Chan Hee Song and Boyu Gou and Dawn Song and Huan Sun and Yu Su},
year={2025},
eprint={2504.01382},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2504.01382},
}
@inproceedings{deng2023mind2web,
author = {Deng, Xiang and Gu, Yu and Zheng, Boyuan and Chen, Shijie and Stevens, Sam and Wang, Boshi and Sun, Huan and Su, Yu},
booktitle = {Advances in Neural Information Processing Systems},
editor = {A. Oh and T. Naumann and A. Globerson and K. Saenko and M. Hardt and S. Levine},
pages = {28091--28114},
publisher = {Curran Associates, Inc.},
title = {Mind2Web: Towards a Generalist Agent for the Web},
url = {https://proceedings.neurips.cc/paper_files/paper/2023/file/5950bf290a1570ea401bf98882128160-Paper-Datasets_and_Benchmarks.pdf},
volume = {36},
year = {2023}
}