Can Vision-Language Models Solve the Shell Game?

VET-Bench simulates classic shell games rendered via Three.js. Humans solve this effortlessly, whereas state-of-the-art VLMs scores at random chance (~33%). Our proposed Molmo2-SGCoT, fine-tuned with just 300 synthetic trajectory samples, reaches over 90% accuracy.

Interactive Demo & Video Generation

The videos are generated using Three.js simulations that run entirely in the browser. You can play the games interactively or batch-export videos as .mp4 files for evaluation.

Demo	File	Description
Play	`cup.html`	Track a ball hidden beneath visually identical opaque cups that undergo positional swaps.
Play	`card.html`	Track a card after being flipped face-down and shuffled.

To play: open the HTML files in a browser and click Start Game. Watch the shuffles, then click your guess.

To generate video data: use the Batch Export panel on the right side to configure resolution, FPS, number of videos, and randomization, then export a .zip of .mp4 files with accompanying metadata JSON.

Getting Started

Download VET-Bench Videos

Clone this repo (includes videos via Git LFS)

git lfs install
git clone https://github.com/liutiedong/shellgame.git

VET-Bench also available from Hugging Face

from huggingface_hub import snapshot_download

snapshot_download(repo_id="tiedong/vetbench", repo_type="dataset", local_dir="vetbench")

Evaluate a Model

Each script under models/ follows the same pattern: load QA.json, send each video + question to APIs, and save responses to results/. For example, to evaluate Gemini:

cd models
python gemini.py

After running a model, compute accuracy with:

# Evaluate all result files in results/cup/
python compute_accuracy.py results/cup/*.json

# Evaluate specific files
python compute_accuracy.py results/cup/gemini-3-pro-preview.json results/card/gemini-3-pro-preview.json

# Show per-item details
python compute_accuracy.py -d results/cup/Molmo2-SGCoT.json

To regenerate QA.json evaluation files from the raw video metadata generated by HTML files:

# Generate QA for cup game
python generate_qa.py vetbench/cup -o vetbench/cup/QA.json

# Generate QA for card game
python generate_qa.py vetbench/card -o vetbench/card/QA.json

Project Overview

Tasks

	Cup Game	Card Game
Setup	A ball is placed under one of 3 cups	3 playing cards shown face-up, then flipped
Action	Cups are shuffled 5 times	Cards are shuffled 5 times
Question	Which cup contains the ball?	Where is the Queen of Hearts?
Videos	50	50
Resolution	640 x 480	640 x 480
FPS	30	30
Duration	~12 seconds	~12 seconds

Results

Random-chance baseline for 3-way multiple choice: 33.3%.

Model	Cup	Card	Overall
Gemini 3 Pro Preview	34.0	40.0	37.0
Gemini 2.5 Pro	38.0	30.0	34.0
Gemini 3 Flash Preview	30.0	30.0	30.0
Gemini 2.5 Flash	22.0	28.0	25.0
Qwen3.5-397B-A17B	38.0	32.0	35.0
Qwen3-VL-30B-A3B-Thinking	38.0	30.0	34.0
Qwen3-VL-8B-Thinking	34.0	30.0	32.0
Qwen3-VL-8B-Instruct	30.0	30.0	30.0
Qwen3-VL-30B-A3B-Instruct	24.0	32.0	28.0
Doubao-Seed-1.8	28.0	38.0	33.0
Doubao-Seed-2.0-Mini	30.0	32.0	31.0
Molmo2-8B	30.0	38.0	34.0
Perception-LM-8B	40.0	34.0	37.0
GLM-4.6V-Flash	34.0	28.0	31.0
Kimi-K2.5	28.0	32.0	30.0
ERNIE-4.5-VL-28B-A3B-Thinking	28.0	30.0	29.0

Molmo2-SGCoT: Spatiotemporal Grounded Chain-of-Thought

Data Generation

The SGCoT/ directory contains the full pipeline:

Synthesize trajectories from real Molmo2 tracking data (real_data/) (generate_cup_tracks.py, generate_card_tracks.py)
Build training data in chat-style JSONL format (generate_training_dataset.py)

The output format uses a <tracks> tag:

<tracks coords="0.0 1 772 524;0.5 1 805 310;...;12.0 1 216 517">the cup that contains the ball</tracks> Answer: left.

t: timestamp (0.0–12.0s, step 0.5s)
obj: object index (always 1)
x, y: coordinates in a [0, 1000] normalized space

The final training set contains 300 samples (200 cup + 100 card). Fine-tuning Molmo2-8B on this data yields Molmo2-SGCoT.

Training

SGCoT/Molmo2-SGCoT.ipynb: finetunes Molmo2-8B using QLoRA (training on 300 samples takes only ~3 min on a single A100).

Generate Training Data

cd SGCoT
python generate_cup_tracks.py        # → trajectory_cup.json (200 samples)
python generate_card_tracks.py       # → trajectory_card.json (100 samples)
python generate_training_dataset.py  # → train.jsonl (300 samples, shuffled)

Data Format

Video Metadata (`cup.json` / `card.json`)

Each entry contains:

video: filename (e.g., cup_001.mp4)
fps, total_frames, resolution: video properties
task: "cup" or "card"
game_settings: number of objects, swap count, speed
ground_truth: correct final position(s)
initial: starting arrangement
intermediate: arrangement after each swap

QA Format (`QA.json`)

Each entry contains:

video: video filename
question: MCQ question with lettered options (A/B/C)
answer: correct letter (A, B, or C)

Example questions:

Cup: Which cup contains the ball at the end of the video? (A) Left (B) Middle (C) Right

Card: Where is the Queen of Hearts at the end of the video? (A) Left (B) Middle (C) Right

Model Output Format (`results/*.json`)

Each entry contains all fields from QA.json plus:

model: model identifier
response: raw model response text

Adding a New Model

Create a new script in models/ following the existing pattern
Implement analyze_video(video_path, prompt_text) -> str
Load QA data from vetbench/{cup,card}/QA.json
Save results to results/{cup,card}/{model_name}.json
Run python compute_accuracy.py results/cup/{model_name}.json to evaluate

Repository Structure

VET-Bench/
├── cup.html                     # Interactive cup game & video generator (Three.js)
├── card.html                    # Interactive card game & video generator (Three.js)
├── vetbench/                    # Core benchmark dataset
│   ├── cup/                     #   Cup game: 50 videos + metadata + QA
│   └── card/                    #   Card game: 50 videos + metadata + QA
├── object_count/                # Ablation: videos for varying number of objects (2/4/5 cups)
├── swap_count/                  # Ablation: videos for varying number of swaps (0–4)
├── perception_test/             # Filtered Subsets for the Perception Test 
│   ├── cup_games_distinct_cups/
│   ├── cup_games_transparent_cups/
│   └── cup_games_filtered/
├── models/                      # Evaluation scripts for each VLM
│   ├── gemini.py                #   Google Gemini (2.5/3 Pro/Flash)
│   ├── qwen.py                  #   Alibaba Qwen3-VL (8B–397B)
│   ├── doubao.py                #   ByteDance Doubao-Seed
│   ├── kimi.py                  #   Moonshot Kimi-K2.5
│   ├── glm.py                   #   Zhipu GLM-4.6V-Flash
│   ├── ernie.py                 #   Baidu ERNIE-4.5-VL
│   ├── molmo2.py                #   Allen AI Molmo2-8B
│   └── perceptionLM.py          #   Meta Perception-LM-8B
├── results/                     # Experiment Results
│   ├── cup/                     #   Per-model results on cup game
│   ├── card/                    #   Per-model results on card game
│   ├── object_count/            #   Per-model results on object count ablation
│   └── swap_count/              #   Per-model results on swap count ablation
├── SGCoT/                       # Spatiotemporal Grounded CoT training pipeline
│   ├── real_data/               #   Real tracking data (cup + card)
│   ├── generate_cup_tracks.py   #   Synthesize cup trajectories
│   ├── generate_card_tracks.py  #   Synthesize card trajectories
│   ├── generate_training_dataset.py  # Build chat-style JSONL from synthesized trajectories
│   ├── trajectory_cup.json      #   200 synthetic cup trajectories
│   ├── trajectory_card.json     #   100 synthetic card trajectories
│   ├── train.jsonl              #   Final 300-sample training set used to train Molmo2-SGCoT
│   └── train.ipynb              #   Colab notebook: QLoRA fine-tuning Molmo2 → Molmo2-SGCoT
├── generate_qa.py               # Generate MCQ evaluation pairs from metadata
├── compute_accuracy.py          # Compute per-task and overall accuracy
└── README.md

Citation

If you find this work useful, please cite:

@misc{liu2026visionlanguagemodelssolveshell,
      title={Can Vision-Language Models Solve the Shell Game?}, 
      author={Tiedong Liu and Wee Sun Lee},
      year={2026},
      eprint={2603.08436},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.08436}, 
}

License

This project is released under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Can Vision-Language Models Solve the Shell Game?

Interactive Demo & Video Generation

Getting Started

Download VET-Bench Videos

Evaluate a Model

Project Overview

Tasks

Results

Molmo2-SGCoT: Spatiotemporal Grounded Chain-of-Thought

Data Generation

Training

Generate Training Data

Data Format

Video Metadata (`cup.json` / `card.json`)

QA Format (`QA.json`)

Model Output Format (`results/*.json`)

Adding a New Model

Repository Structure

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
SGCoT		SGCoT
models		models
object_count		object_count
perception_test		perception_test
results		results
swap_count		swap_count
vetbench		vetbench
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
card.html		card.html
compute_accuracy.py		compute_accuracy.py
cup.html		cup.html
generate_qa.py		generate_qa.py

Folders and files

Latest commit

History

Repository files navigation

Can Vision-Language Models Solve the Shell Game?

Interactive Demo & Video Generation

Getting Started

Download VET-Bench Videos

Evaluate a Model

Project Overview

Tasks

Results

Molmo2-SGCoT: Spatiotemporal Grounded Chain-of-Thought

Data Generation

Training

Generate Training Data

Data Format

Video Metadata (cup.json / card.json)

QA Format (QA.json)

Model Output Format (results/*.json)

Adding a New Model

Repository Structure

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Video Metadata (`cup.json` / `card.json`)

QA Format (`QA.json`)

Model Output Format (`results/*.json`)

Packages