VET-Bench simulates classic shell games rendered via Three.js. Humans solve this effortlessly, whereas state-of-the-art VLMs scores at random chance (~33%). Our proposed Molmo2-SGCoT, fine-tuned with just 300 synthetic trajectory samples, reaches over 90% accuracy.
The videos are generated using Three.js simulations that run entirely in the browser. You can play the games interactively or batch-export videos as .mp4 files for evaluation.
| Demo | File | Description |
|---|---|---|
| Play | cup.html |
Track a ball hidden beneath visually identical opaque cups that undergo positional swaps. |
| Play | card.html |
Track a card after being flipped face-down and shuffled. |
To play: open the HTML files in a browser and click Start Game. Watch the shuffles, then click your guess.
To generate video data: use the Batch Export panel on the right side to configure resolution, FPS, number of videos, and randomization, then export a .zip of .mp4 files with accompanying metadata JSON.
Clone this repo (includes videos via Git LFS)
git lfs install
git clone https://github.com/liutiedong/shellgame.gitVET-Bench also available from Hugging Face
from huggingface_hub import snapshot_download
snapshot_download(repo_id="tiedong/vetbench", repo_type="dataset", local_dir="vetbench")Each script under models/ follows the same pattern: load QA.json, send each video + question to APIs, and save responses to results/. For example, to evaluate Gemini:
cd models
python gemini.pyAfter running a model, compute accuracy with:
# Evaluate all result files in results/cup/
python compute_accuracy.py results/cup/*.json
# Evaluate specific files
python compute_accuracy.py results/cup/gemini-3-pro-preview.json results/card/gemini-3-pro-preview.json
# Show per-item details
python compute_accuracy.py -d results/cup/Molmo2-SGCoT.jsonTo regenerate QA.json evaluation files from the raw video metadata generated by HTML files:
# Generate QA for cup game
python generate_qa.py vetbench/cup -o vetbench/cup/QA.json
# Generate QA for card game
python generate_qa.py vetbench/card -o vetbench/card/QA.json| Cup Game | Card Game | |
|---|---|---|
| Setup | A ball is placed under one of 3 cups | 3 playing cards shown face-up, then flipped |
| Action | Cups are shuffled 5 times | Cards are shuffled 5 times |
| Question | Which cup contains the ball? | Where is the Queen of Hearts? |
| Videos | 50 | 50 |
| Resolution | 640 x 480 | 640 x 480 |
| FPS | 30 | 30 |
| Duration | ~12 seconds | ~12 seconds |
Random-chance baseline for 3-way multiple choice: 33.3%.
| Model | Cup | Card | Overall |
|---|---|---|---|
| Gemini 3 Pro Preview | 34.0 | 40.0 | 37.0 |
| Gemini 2.5 Pro | 38.0 | 30.0 | 34.0 |
| Gemini 3 Flash Preview | 30.0 | 30.0 | 30.0 |
| Gemini 2.5 Flash | 22.0 | 28.0 | 25.0 |
| Qwen3.5-397B-A17B | 38.0 | 32.0 | 35.0 |
| Qwen3-VL-30B-A3B-Thinking | 38.0 | 30.0 | 34.0 |
| Qwen3-VL-8B-Thinking | 34.0 | 30.0 | 32.0 |
| Qwen3-VL-8B-Instruct | 30.0 | 30.0 | 30.0 |
| Qwen3-VL-30B-A3B-Instruct | 24.0 | 32.0 | 28.0 |
| Doubao-Seed-1.8 | 28.0 | 38.0 | 33.0 |
| Doubao-Seed-2.0-Mini | 30.0 | 32.0 | 31.0 |
| Molmo2-8B | 30.0 | 38.0 | 34.0 |
| Perception-LM-8B | 40.0 | 34.0 | 37.0 |
| GLM-4.6V-Flash | 34.0 | 28.0 | 31.0 |
| Kimi-K2.5 | 28.0 | 32.0 | 30.0 |
| ERNIE-4.5-VL-28B-A3B-Thinking | 28.0 | 30.0 | 29.0 |
The SGCoT/ directory contains the full pipeline:
- Synthesize trajectories from real Molmo2 tracking data (
real_data/) (generate_cup_tracks.py,generate_card_tracks.py) - Build training data in chat-style JSONL format (
generate_training_dataset.py)
The output format uses a <tracks> tag:
<tracks coords="0.0 1 772 524;0.5 1 805 310;...;12.0 1 216 517">the cup that contains the ball</tracks> Answer: left.
- t: timestamp (0.0–12.0s, step 0.5s)
- obj: object index (always 1)
- x, y: coordinates in a [0, 1000] normalized space
The final training set contains 300 samples (200 cup + 100 card). Fine-tuning Molmo2-8B on this data yields Molmo2-SGCoT.
SGCoT/Molmo2-SGCoT.ipynb: finetunes Molmo2-8B using QLoRA (training on 300 samples takes only ~3 min on a single A100).
cd SGCoT
python generate_cup_tracks.py # → trajectory_cup.json (200 samples)
python generate_card_tracks.py # → trajectory_card.json (100 samples)
python generate_training_dataset.py # → train.jsonl (300 samples, shuffled)Each entry contains:
video: filename (e.g.,cup_001.mp4)fps,total_frames,resolution: video propertiestask:"cup"or"card"game_settings: number of objects, swap count, speedground_truth: correct final position(s)initial: starting arrangementintermediate: arrangement after each swap
Each entry contains:
video: video filenamequestion: MCQ question with lettered options (A/B/C)answer: correct letter (A, B, or C)
Example questions:
Cup: Which cup contains the ball at the end of the video? (A) Left (B) Middle (C) Right
Card: Where is the Queen of Hearts at the end of the video? (A) Left (B) Middle (C) Right
Each entry contains all fields from QA.json plus:
model: model identifierresponse: raw model response text
- Create a new script in
models/following the existing pattern - Implement
analyze_video(video_path, prompt_text) -> str - Load QA data from
vetbench/{cup,card}/QA.json - Save results to
results/{cup,card}/{model_name}.json - Run
python compute_accuracy.py results/cup/{model_name}.jsonto evaluate
VET-Bench/
├── cup.html # Interactive cup game & video generator (Three.js)
├── card.html # Interactive card game & video generator (Three.js)
├── vetbench/ # Core benchmark dataset
│ ├── cup/ # Cup game: 50 videos + metadata + QA
│ └── card/ # Card game: 50 videos + metadata + QA
├── object_count/ # Ablation: videos for varying number of objects (2/4/5 cups)
├── swap_count/ # Ablation: videos for varying number of swaps (0–4)
├── perception_test/ # Filtered Subsets for the Perception Test
│ ├── cup_games_distinct_cups/
│ ├── cup_games_transparent_cups/
│ └── cup_games_filtered/
├── models/ # Evaluation scripts for each VLM
│ ├── gemini.py # Google Gemini (2.5/3 Pro/Flash)
│ ├── qwen.py # Alibaba Qwen3-VL (8B–397B)
│ ├── doubao.py # ByteDance Doubao-Seed
│ ├── kimi.py # Moonshot Kimi-K2.5
│ ├── glm.py # Zhipu GLM-4.6V-Flash
│ ├── ernie.py # Baidu ERNIE-4.5-VL
│ ├── molmo2.py # Allen AI Molmo2-8B
│ └── perceptionLM.py # Meta Perception-LM-8B
├── results/ # Experiment Results
│ ├── cup/ # Per-model results on cup game
│ ├── card/ # Per-model results on card game
│ ├── object_count/ # Per-model results on object count ablation
│ └── swap_count/ # Per-model results on swap count ablation
├── SGCoT/ # Spatiotemporal Grounded CoT training pipeline
│ ├── real_data/ # Real tracking data (cup + card)
│ ├── generate_cup_tracks.py # Synthesize cup trajectories
│ ├── generate_card_tracks.py # Synthesize card trajectories
│ ├── generate_training_dataset.py # Build chat-style JSONL from synthesized trajectories
│ ├── trajectory_cup.json # 200 synthetic cup trajectories
│ ├── trajectory_card.json # 100 synthetic card trajectories
│ ├── train.jsonl # Final 300-sample training set used to train Molmo2-SGCoT
│ └── train.ipynb # Colab notebook: QLoRA fine-tuning Molmo2 → Molmo2-SGCoT
├── generate_qa.py # Generate MCQ evaluation pairs from metadata
├── compute_accuracy.py # Compute per-task and overall accuracy
└── README.md
If you find this work useful, please cite:
@misc{liu2026visionlanguagemodelssolveshell,
title={Can Vision-Language Models Solve the Shell Game?},
author={Tiedong Liu and Wee Sun Lee},
year={2026},
eprint={2603.08436},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.08436},
}This project is released under the MIT License.