Online-Mind2Web Benchmark

Tianci Xue^,1, Weijian Qi^*,1, Tianneng Shi^*2, Chan Hee Song¹, Boyu Gou¹, Dawn Song^,2, Huan Sun^†,1 Yu Su^†,1

¹The Ohio State University, ²University of California, Berkeley
_{^*Equal contribution, ^†Equal advising}

📃 Paper • 📃 Blog • 🏆 Leaderboard • 🤗 Data

Online-Mind2Web benchmark

News

[05/11/2025] Check out our updates in the paper.
- The performance of Claude Computer Use 3.7.
- WebJudge(o4-mini) achieves high agreement (86%) with a low success rate gap (3.8%) compared with humans.
- Release WebJudge-7B, a robust and reliable reward model for Reinforcement learning.

Tasks

Online-Mind2Web includes 300 diverse tasks from 136 popular websites across various domains. It covers a diverse set of real-world user tasks, such as clothing, food, housing, and transportation, to evaluate web agents' performance in a real-world online environment.

Update Tasks

We will regularly update Online-Mind2Web by replacing outdated or invalid tasks (e.g., due to website changes) to maintain its value as a rigorous benchmark for web agents. If you find any tasks are outdated, please reach out to us, and we will update them.

To ensure fair comparisons, we will aim to keep the updated tasks on the same websites as before and with a similar reference length. Additionally, once agent performance saturates on Online-Mind2Web, we will also revise simple tasks to preserve its long-term value.

Automatic Evaluator via LLM-as-a-Judge (WebJudge)

To enhance the reliability and scalability of the evaluation process in online environments, We propose a more reliable automatic evaluation method called WebJudge, which consists of three parts. (1) Key Point Identification: The model is prompted to identify several key points necessary for completing the task, based on the given instruction and task description. (2) Key Screenshot Identification: Important screenshots are selected from the agent’s trajectory to retain relevant visual evidence while discarding uninformative frames. (3) Outcome Judgment: Output the judgement result based on the task description, key points, key screenshots, and the action history. Our method preserves critical intermediate screenshots while mitigating the token overload issue.

Results

Comparison against Existing Evaluation Methods on Online-Mind2Web

Model	Auto-Eval	SeeAct	Agent-E	Browser Use	Claude 3.5	Claude 3.7	Operator	Avg AR
GPT-4o	Autonomous Eval	84.7	85.0	76.0	83.7	75.5	71.7	79.4
	AgentTrek Eval	73.0	64.3	63.3	--	--	--	66.9
	WebVoyager	--	75.3	71.3	74.0	72.0	76.7	73.9
	WebJudge	86.7	86.0	81.4	86.3	79.1	81.8	83.6
o4-mini	Autonomous Eval	79.7	85.7	86.0	84.3	68.0	73.3	79.5
	WebVoyager	--	80.3	79.0	81.7	74.3	78.3	78.7
	WebJudge	85.3	86.3	89.3	87.0	82.3	83.7	85.7
	WebJudge-7B	86.0	87.3	88.3	89.7	84.3	86.3	87.0

WebJudge powered by GPT-4o and o4-mini consistently achieves the highest agreement, with averages of 83.6% and 85.7%, respectively. Meanwhile, WebJudge-7B even outperforms o4-mini, reaching a high agreement with human judgment of 87%.

Excellent generalization capabilities on AgentRewardBench (5 OOD benchmarks)

Methods	AB	VWA	WA	Work	Wk++	Overall
Rule-based*	25.0	85.2	79.0	100.0	83.3	83.8
Autonomous Eval*	83.3	61.2	67.6	96.4	59.3	67.6
GPT-4o (A11y Tree)*	77.8	63.0	70.2	94.6	63.0	69.8
WebJudge (GPT-4o)	66.7	69.8	72.6	92.3	75.0	73.7
WebJudge-7B	80.0	66.7	77.5	100.0	70.0	75.7
WebJudge (o4-mini)	100.0	74.5	81.2	100.0	90.0	82.0

WebJudge significantly outperforms existing methods, achieving impressive overall precision of 73.7% 75.7% and 82.0% on WebArena (WA), VisualWebArena (VWA), AssistantBench (AB), WorkArena (Work) and WorkArena++ (Wk++) across 1302 trajectories.

The high precision suggests that WebJudge holds potential as a robust and scalable reward model for downstream applications such as Rejection Sampling Fine-Tuning, Reflection, and Reinforcement Learning.

Model Release

We have released the fine-tuned WebJudge-7B weights, which are now available on Hugging Face.

Setup Environment

Create a conda environment and install dependencies:

conda create -n Online_Mind2Web python=3.11
conda activate Online_Mind2Web
pip install -r requirements.txt

Evaluation

You can run the provided example evaluation script directly to perform the evaluation. Adjust the "mode" parameter to choose among various auto-eval methods.

bash ./script/eval.sh

Important Notes for Reliable Evaluation on Online-Mind2Web:

Important

Start from the specified websites, not Google Search:To enable fair comparisons, please ensure that each task starts from the specified website in our benchmark. Starting from Google Search or alternative websites can lead agents to use different websites to solve the task, resulting in varying difficulty levels and potentially skewed evaluation results.
Include only factual actions, not agent outputs: The action history should contain only the factual actions taken by the agent to complete the task (e.g., clicking elements and Typing text). Do not include the final response or any other agent's outputs, as they may contain hallucinated content and result in a high rate of false positives.
Use o4-mini for WebJudge: WebJudge powered by o4-mini demonstrates a higher alignment with human judgment, achieving an average agreement rate of 85.7% and maintaining a narrow success rate gap of just 3.8%. Therefore, please use o4-mini as the backbone for automatic evaluation.

Evaluation Results

In certain scenarios, testing on the full Online-Mind2Web dataset may not be feasible due to cost, privacy, or legal constraints. To facilitate fair and apple-to-apple comparisons, we release both our human evaluation labels and auto-eval details.

Human Evaluation: Task-level human evaluation labels are provided in the file.
Auto-Evaluation: The results of WebJudge are available in the folder.

📚 Citation

Note: Online-Mind2Web is derived from the original Mind2Web dataset. We kindly ask that you cite both the original and this work when using or referencing the data.

@article{xue2025illusionprogressassessingcurrent,
      title={An Illusion of Progress? Assessing the Current State of Web Agents}, 
      author={Tianci Xue and Weijian Qi and Tianneng Shi and Chan Hee Song and Boyu Gou and Dawn Song and Huan Sun and Yu Su},
      year={2025},
      eprint={2504.01382},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2504.01382}, 
}

@inproceedings{deng2023mind2web,
 author = {Deng, Xiang and Gu, Yu and Zheng, Boyuan and Chen, Shijie and Stevens, Sam and Wang, Boshi and Sun, Huan and Su, Yu},
 booktitle = {Advances in Neural Information Processing Systems},
 editor = {A. Oh and T. Naumann and A. Globerson and K. Saenko and M. Hardt and S. Levine},
 pages = {28091--28114},
 publisher = {Curran Associates, Inc.},
 title = {Mind2Web: Towards a Generalist Agent for the Web},
 url = {https://proceedings.neurips.cc/paper_files/paper/2023/file/5950bf290a1570ea401bf98882128160-Paper-Datasets_and_Benchmarks.pdf},
 volume = {36},
 year = {2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
data		data
images		images
script		script
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Online-Mind2Web Benchmark

Online-Mind2Web benchmark

News

Tasks

Update Tasks

Automatic Evaluator via LLM-as-a-Judge (WebJudge)

Results

Comparison against Existing Evaluation Methods on Online-Mind2Web

Excellent generalization capabilities on AgentRewardBench (5 OOD benchmarks)

Model Release

Setup Environment

Evaluation

Important Notes for Reliable Evaluation on Online-Mind2Web:

Evaluation Results

📚 Citation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

OSU-NLP-Group/Online-Mind2Web

Folders and files

Latest commit

History

Repository files navigation

Online-Mind2Web Benchmark

Online-Mind2Web benchmark

News

Tasks

Update Tasks

Automatic Evaluator via LLM-as-a-Judge (WebJudge)

Results

Comparison against Existing Evaluation Methods on Online-Mind2Web

Excellent generalization capabilities on AgentRewardBench (5 OOD benchmarks)

Model Release

Setup Environment

Evaluation

Important Notes for Reliable Evaluation on Online-Mind2Web:

Evaluation Results

📚 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages