Skip to content

liutiedong/shellgame

Repository files navigation

Can Vision-Language Models Solve the Shell Game?

Paper Project Page Dataset SGCoT Data HF Model Open In Colab Demo


VET-Bench simulates classic shell games rendered via Three.js. Humans solve this effortlessly, whereas state-of-the-art VLMs scores at random chance (~33%). Our proposed Molmo2-SGCoT, fine-tuned with just 300 synthetic trajectory samples, reaches over 90% accuracy.

Interactive Demo & Video Generation

The videos are generated using Three.js simulations that run entirely in the browser. You can play the games interactively or batch-export videos as .mp4 files for evaluation.

Demo File Description
Play cup.html Track a ball hidden beneath visually identical opaque cups that undergo positional swaps.
Play card.html Track a card after being flipped face-down and shuffled.

To play: open the HTML files in a browser and click Start Game. Watch the shuffles, then click your guess.

To generate video data: use the Batch Export panel on the right side to configure resolution, FPS, number of videos, and randomization, then export a .zip of .mp4 files with accompanying metadata JSON.

Getting Started

Download VET-Bench Videos

Clone this repo (includes videos via Git LFS)

git lfs install
git clone https://github.com/liutiedong/shellgame.git

VET-Bench also available from Hugging Face

from huggingface_hub import snapshot_download

snapshot_download(repo_id="tiedong/vetbench", repo_type="dataset", local_dir="vetbench")

Evaluate a Model

Each script under models/ follows the same pattern: load QA.json, send each video + question to APIs, and save responses to results/. For example, to evaluate Gemini:

cd models
python gemini.py

After running a model, compute accuracy with:

# Evaluate all result files in results/cup/
python compute_accuracy.py results/cup/*.json

# Evaluate specific files
python compute_accuracy.py results/cup/gemini-3-pro-preview.json results/card/gemini-3-pro-preview.json

# Show per-item details
python compute_accuracy.py -d results/cup/Molmo2-SGCoT.json

To regenerate QA.json evaluation files from the raw video metadata generated by HTML files:

# Generate QA for cup game
python generate_qa.py vetbench/cup -o vetbench/cup/QA.json

# Generate QA for card game
python generate_qa.py vetbench/card -o vetbench/card/QA.json

Project Overview

Tasks

Cup Game Card Game
Setup A ball is placed under one of 3 cups 3 playing cards shown face-up, then flipped
Action Cups are shuffled 5 times Cards are shuffled 5 times
Question Which cup contains the ball? Where is the Queen of Hearts?
Videos 50 50
Resolution 640 x 480 640 x 480
FPS 30 30
Duration ~12 seconds ~12 seconds

Results

Random-chance baseline for 3-way multiple choice: 33.3%.

Model Cup Card Overall
Gemini 3 Pro Preview 34.0 40.0 37.0
Gemini 2.5 Pro 38.0 30.0 34.0
Gemini 3 Flash Preview 30.0 30.0 30.0
Gemini 2.5 Flash 22.0 28.0 25.0
Qwen3.5-397B-A17B 38.0 32.0 35.0
Qwen3-VL-30B-A3B-Thinking 38.0 30.0 34.0
Qwen3-VL-8B-Thinking 34.0 30.0 32.0
Qwen3-VL-8B-Instruct 30.0 30.0 30.0
Qwen3-VL-30B-A3B-Instruct 24.0 32.0 28.0
Doubao-Seed-1.8 28.0 38.0 33.0
Doubao-Seed-2.0-Mini 30.0 32.0 31.0
Molmo2-8B 30.0 38.0 34.0
Perception-LM-8B 40.0 34.0 37.0
GLM-4.6V-Flash 34.0 28.0 31.0
Kimi-K2.5 28.0 32.0 30.0
ERNIE-4.5-VL-28B-A3B-Thinking 28.0 30.0 29.0

Molmo2-SGCoT: Spatiotemporal Grounded Chain-of-Thought

Data Generation

The SGCoT/ directory contains the full pipeline:

  1. Synthesize trajectories from real Molmo2 tracking data (real_data/) (generate_cup_tracks.py, generate_card_tracks.py)
  2. Build training data in chat-style JSONL format (generate_training_dataset.py)

The output format uses a <tracks> tag:

<tracks coords="0.0 1 772 524;0.5 1 805 310;...;12.0 1 216 517">the cup that contains the ball</tracks> Answer: left.
  • t: timestamp (0.0–12.0s, step 0.5s)
  • obj: object index (always 1)
  • x, y: coordinates in a [0, 1000] normalized space

The final training set contains 300 samples (200 cup + 100 card). Fine-tuning Molmo2-8B on this data yields Molmo2-SGCoT.

Training

Open In Colab

SGCoT/Molmo2-SGCoT.ipynb: finetunes Molmo2-8B using QLoRA (training on 300 samples takes only ~3 min on a single A100).

Generate Training Data

cd SGCoT
python generate_cup_tracks.py        # → trajectory_cup.json (200 samples)
python generate_card_tracks.py       # → trajectory_card.json (100 samples)
python generate_training_dataset.py  # → train.jsonl (300 samples, shuffled)

Data Format

Video Metadata (cup.json / card.json)

Each entry contains:

  • video: filename (e.g., cup_001.mp4)
  • fps, total_frames, resolution: video properties
  • task: "cup" or "card"
  • game_settings: number of objects, swap count, speed
  • ground_truth: correct final position(s)
  • initial: starting arrangement
  • intermediate: arrangement after each swap

QA Format (QA.json)

Each entry contains:

  • video: video filename
  • question: MCQ question with lettered options (A/B/C)
  • answer: correct letter (A, B, or C)

Example questions:

Cup: Which cup contains the ball at the end of the video? (A) Left (B) Middle (C) Right

Card: Where is the Queen of Hearts at the end of the video? (A) Left (B) Middle (C) Right

Model Output Format (results/*.json)

Each entry contains all fields from QA.json plus:

  • model: model identifier
  • response: raw model response text

Adding a New Model

  1. Create a new script in models/ following the existing pattern
  2. Implement analyze_video(video_path, prompt_text) -> str
  3. Load QA data from vetbench/{cup,card}/QA.json
  4. Save results to results/{cup,card}/{model_name}.json
  5. Run python compute_accuracy.py results/cup/{model_name}.json to evaluate

Repository Structure

VET-Bench/
├── cup.html                     # Interactive cup game & video generator (Three.js)
├── card.html                    # Interactive card game & video generator (Three.js)
├── vetbench/                    # Core benchmark dataset
│   ├── cup/                     #   Cup game: 50 videos + metadata + QA
│   └── card/                    #   Card game: 50 videos + metadata + QA
├── object_count/                # Ablation: videos for varying number of objects (2/4/5 cups)
├── swap_count/                  # Ablation: videos for varying number of swaps (0–4)
├── perception_test/             # Filtered Subsets for the Perception Test 
│   ├── cup_games_distinct_cups/
│   ├── cup_games_transparent_cups/
│   └── cup_games_filtered/
├── models/                      # Evaluation scripts for each VLM
│   ├── gemini.py                #   Google Gemini (2.5/3 Pro/Flash)
│   ├── qwen.py                  #   Alibaba Qwen3-VL (8B–397B)
│   ├── doubao.py                #   ByteDance Doubao-Seed
│   ├── kimi.py                  #   Moonshot Kimi-K2.5
│   ├── glm.py                   #   Zhipu GLM-4.6V-Flash
│   ├── ernie.py                 #   Baidu ERNIE-4.5-VL
│   ├── molmo2.py                #   Allen AI Molmo2-8B
│   └── perceptionLM.py          #   Meta Perception-LM-8B
├── results/                     # Experiment Results
│   ├── cup/                     #   Per-model results on cup game
│   ├── card/                    #   Per-model results on card game
│   ├── object_count/            #   Per-model results on object count ablation
│   └── swap_count/              #   Per-model results on swap count ablation
├── SGCoT/                       # Spatiotemporal Grounded CoT training pipeline
│   ├── real_data/               #   Real tracking data (cup + card)
│   ├── generate_cup_tracks.py   #   Synthesize cup trajectories
│   ├── generate_card_tracks.py  #   Synthesize card trajectories
│   ├── generate_training_dataset.py  # Build chat-style JSONL from synthesized trajectories
│   ├── trajectory_cup.json      #   200 synthetic cup trajectories
│   ├── trajectory_card.json     #   100 synthetic card trajectories
│   ├── train.jsonl              #   Final 300-sample training set used to train Molmo2-SGCoT
│   └── train.ipynb              #   Colab notebook: QLoRA fine-tuning Molmo2 → Molmo2-SGCoT
├── generate_qa.py               # Generate MCQ evaluation pairs from metadata
├── compute_accuracy.py          # Compute per-task and overall accuracy
└── README.md

Citation

If you find this work useful, please cite:

@misc{liu2026visionlanguagemodelssolveshell,
      title={Can Vision-Language Models Solve the Shell Game?}, 
      author={Tiedong Liu and Wee Sun Lee},
      year={2026},
      eprint={2603.08436},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.08436}, 
}

License

This project is released under the MIT License.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors