autohub-benchmark

This project designs common use scenarios for web-based code, model, and dataset hosting platforms, and provides corresponding prompts and ground truth. These resources can be used to evaluate the localization performance of visual language models (VLMs) in specialized scenarios.

Model Hosting Platform GUI Inference

Model	Platform	Accuracy (%)	Error (%)	Invalid (%)	Completion Rate (%)
AriaUI	Huggingface	70.8	12.5	6.7	100.0
	ModelScope	57.6	14.2	28.2	100.0
	OpenCSG	81.0	9.5	9.5	100.0
CogAgent	Huggingface	73.3	26.7	0.0	100.0
	ModelScope	57.9	29.1	13.0	96.3
	OpenCSG	57.1	19.0	23.8	100.0
Qwen3B	Huggingface	8.3	15.8	19.2	41.7
	ModelScope	0.0	28.6	20.6	49.2
	OpenCSG	4.8	4.8	9.5	19.0
Qwen7B	Huggingface	73.3	11.7	10.8	95.8
	ModelScope	55.5	30.2	8.5	95.2
	OpenCSG	71.4	14.3	14.3	100.0
SeeClick	Huggingface	39.2	36.7	24.2	100.0
	ModelScope	52.4	29.0	18.6	100.0
	OpenCSG	52.4	14.3	33.3	100.0
ShowUI	Huggingface	30.0	45.0	11.7	86.7
	ModelScope	43.3	26.7	14.3	88.9
	OpenCSG	23.8	52.4	9.5	85.7

Summery:

Model	Accuracy (%)	Error (%)	Invalid (%)	Completion Rate (%)
AriaUI	67.7	19.3	11.4	100.0
CogAgent	63.3	34.8	3.0	98.7
Qwen3B	4.5	10.8	12.9	62.6
Qwen7B	66.9	18.8	10.1	100.0
SeeClick	45.6	26.2	26.8	97.9
ShowUI	32.1	42.8	11.6	85.0

Code Hosting Platform GUI Inference

Model	Platform	Accuracy (%)	Error (%)	Invalid (%)	Completion Rate (%)
AriaUI	GitCode	57.1	28.5	14.3	100.0
	Gitea	71.4	28.5	0.0	100.0
	Gitee	57.1	28.5	14.3	100.0
	Github	71.4	14.3	14.3	100.0
	GitLab	71.4	14.3	14.3	100.0
CogAgent	GitCode	71.4	28.5	0.0	100.0
	Gitea	71.4	28.5	0.0	100.0
	Gitee	100.0	0.0	0.0	100.0
	Github	57.1	42.8	0.0	100.0
	GitLab	85.7	14.3	0.0	100.0
Qwen3B	GitCode	14.2	28.5	42.8	85.7
	Gitea	14.2	57.1	14.2	85.7
	Gitee	14.2	42.8	28.5	100.0
	Github	0.0	28.5	57.1	85.7
	GitLab	14.2	28.5	28.5	71.4
Qwen7B	GitCode	71.4	0.0	28.5	100.0
	Gitea	57.1	28.5	14.2	100.0
	Gitee	28.5	57.1	14.2	100.0
	Github	0.0	14.2	85.7	100.0
	GitLab	85.7	14.2	0.0	100.0
SeeClick	GitCode	28.5	48.5	28.5	100.0
	Gitea	28.5	28.5	48.5	100.0
	Gitee	28.5	57.1	14.2	100.0
	Github	14.2	57.1	28.5	100.0
	GitLab	0.0	71.4	28.5	100.0
ShowUI	GitCode	28.5	48.5	14.2	85.7
	Gitea	57.1	48.5	0.0	100.0
	Gitee	57.1	28.5	0.0	85.7
	Github	48.5	14.2	28.5	85.7
	GitLab	48.5	14.2	14.2	71.4

Summery:

Model	Platform	Accuracy (%)	Error (%)	Invalid (%)
AriaUI	65.7	22.8	11.4	100.0
CogAgent	62.9	22.8	0.0	100.0
Qwen3B	11.4	37.1	37.1	85.7
Qwen7B	48.5	22.9	28.6	100.0
SeeClick	20.0	51.4	28.6	100.0
ShowUI	45.7	28.6	11.4	85.7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

autohub-benchmark

Model Hosting Platform GUI Inference

Code Hosting Platform GUI Inference

Files

README.md

Latest commit

History

README.md

File metadata and controls

autohub-benchmark

Model Hosting Platform GUI Inference

Code Hosting Platform GUI Inference