VINCIE: Unlocking In-context Image Editing from Video

Leigang Qu, Feng Cheng, Ziyan Yang, Qi Zhao, Shanchuan Lin, Yichun Shi, Yicong Li, Wenjie Wang, Tat-Seng Chua, Lu Jiang

In-context image editing aims to modify images based on a contextual sequence comprising text and previously generated images. Existing methods typically depend on task-specific pipelines and expert models (e.g., segmentation and inpainting) to curate training data. In this work, we explore whether an in-context image editing model can be learned directly from videos. We introduce a scalable approach to annotate videos as interleaved multimodal sequences. To effectively learn from this data, we design a block-causal diffusion transformer trained on three proxy tasks: next-image prediction, current segmentation prediction, and next-segmentation prediction. Additionally, we propose a novel multi-turn image editing benchmark to advance research in this area. Extensive experiments demonstrate that our model exhibits strong in-context image editing capabilities and achieves state-of-the-art results on two multi-turn image editing benchmarks. Despite being trained exclusively on videos, our model also shows promising abilities in multi-concept composition, story generation, and chain-of-editing applications.

News

19 Mar, 2026: Released the Evaluation Code on MSE-Bench.
15 Mar, 2026: Released the Generated Images on MSE-Bench, including VINCIE-3B, VINCIE-7B, Nano Banana, Qwen-Image-Edit, FLUX.1-Kontext-dev, Bagel, Step1X-Edit, Omnigen 2, Omnigen, ICEdit, UltraEdit, HQEdit, Magicbrush, and InstructPix2Pix.
15 Mar, 2026: Released the Multi-turn Session image Editing Benchmark (MSE-Bench).
6 Jan, 2026: Released the VINCIE-7B checkpoint (full attention).
6 Sep, 2025: Released the VINCIE-3B checkpoint (full attention).
25 Aug, 2025: Released the official website and the inference code.
23 Aug, 2025: Released the VINCIE-10M dataset.
12 Jun, 2025: Released the VINCIE technical report .

Quick Start

1️⃣ Set up environment

git clone https://github.com/ByteDance-Seed/VINCIE
cd VINCIE
conda create -n vincie python=3.10 -y
conda activate vincie
pip install -r requirements.txt
pip install flash_attn==2.6.3 --no-build-isolation

2️⃣ Download pretrained checkpoint

from huggingface_hub import snapshot_download

save_dir = "ckpt/VINCIE-3B"
repo_id = "ByteDance-Seed/VINCIE-3B"
cache_dir = save_dir + "/cache"

snapshot_download(cache_dir=cache_dir,
  local_dir=save_dir,
  repo_id=repo_id,
  local_dir_use_symlinks=False,
  resume_download=True
)

Inference for Multi-turn Image Editing

turn1="Lower the pineapple beside her face, and change it to a smaller one."
turn2="Add a crown to the woman's head. "
turn3="Change the woman’s expression so that she is laughing."
turn4="Change the background to a pastel gradient of blue and lavender."
turn5="Add a colorful bird hovering above the crown."
input_img=assets/woman_pineapple.png
output_dir=output/woman_pineapple

python main.py configs/generate.yaml \
    generation.positive_prompt.image_path="[\"$input_img\"]" \
    generation.positive_prompt.prompts="[\"$turn1\", \"$turn2\", \"$turn3\", \"$turn4\", \"$turn5\"]" \
    generation.output.dir=$output_dir

Inference for Multi-concept Composition

p1="<IMG1>: "; p2="<IMG2>: "; p3="<IMG3>: "; p4="<IMG4>: "; p5="<IMG5>: "
p6="Based on <IMG0>, <IMG1>, <IMG2>, <IMG3>, <IMG4>, and <IMG5>, A smiling multi-generational family including the father in <IMG0>, mother in <IMG1>, son in <IMG2>, daughter in <IMG3>, dog in <IMG4>, and cat in <IMG5>,  poses for a portrait amidst the sunlit trees and ferns of a forest. Output <IMG6>: "
img0="./assets/father.png"; img1="./assets/mother.png"; img2="./assets/son.png"; img3="./assets/daughter.png"; img4="./assets/dog1.png"; img5="./assets/cat.png"; 
output_dir=output/family

python main.py configs/generate.yaml \
    generation.pad_img_placehoder=False \
    generation.positive_prompt.image_path="[\"$img0\", \"$img1\", \"$img2\", \"$img3\", \"$img4\", \"$img5\"]" \
    generation.positive_prompt.prompts="[\"$p1\", \"$p2\", \"$p3\", \"$p4\", \"$p5\", \"$p6\"]" \
    generation.output.dir=$output_dir

Evaluation

To evaluate multi-turn image editing performance on the MSE-Bench benchmark:

Install dependencies:

cd evaluation
pip install -r evaluation/requirements.txt

Set your OpenAI-compatible API key:

export OPENAI_API_KEY="<YOUR_KEY>"

Run evaluation:

model_name="vincie_7b"
python3 compute_score.py \
  --model_name "$model_name" \
  --api_model gpt-5-nano \
  --num_workers 32 \
  --res_path ./tmp_data/results/"$model_name".json

This evaluates prompt-following and consistency using a VLM. Results are saved to the specified path. See evaluation/README.md for details.

Citation

@article{qu2025vincie,
  title   = {VINCIE: Unlocking In-context Image Editing from Video},
  author  = {Qu, Leigang and Cheng, Feng and Yang, Ziyan and Zhao, Qi and Lin, Shanchuan and Shi, Yichun and Li, Yicong and Wang, Wenjie and Chua, Tat-Seng and Jiang, Lu},
  journal = {arXiv preprint arXiv:2506.10941},
  year    = {2025}
}

License

This project is licensed under the Apache-2.0 License, subject to any intellectual property rights in the model owned by ByteDance. The text encoder of the model is adapted from Qwen-14B and your use of that model must comply with its license.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
assets		assets
benchmark		benchmark
common		common
configs		configs
data/image/transforms		data/image/transforms
evaluation		evaluation
models		models
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
generate.py		generate.py
generate_distributed.py		generate_distributed.py
main.py		main.py
main.sh		main.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VINCIE: Unlocking In-context Image Editing from Video

News

Quick Start

Inference for Multi-turn Image Editing

Inference for Multi-concept Composition

Evaluation

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VINCIE: Unlocking In-context Image Editing from Video

News

Quick Start

Inference for Multi-turn Image Editing

Inference for Multi-concept Composition

Evaluation

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages