Skip to content

OpenGVLab/Docopilot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Docopilot: Improving Multimodal Models for Document-Level Understanding

The official implementation of the paper "Docopilot: Improving Multimodal Models for Document-Level Understanding".

📕 Overview

  • We construct Doc-750K, the first large-scale, high-quality dataset for document-level multimodal understanding, with 758K QA pairs covering 9 task types.

  • We propose Docopilot, a native document-level VLM that outperforms existing methods and Gemini-1.5-Pro on MMLongBench-Doc, making it the closest open-source model to GPT-4o.

  • Docopilot achieves much lower inference latency than RAG-based methods, and when combined with RAG, its performance further improves, showing that RAG effectively enhances its retrieval and reasoning.

🗓️ Schedule

  • Release Evaluation Code
  • Release Training Code
  • Release Doc-750K
  • Release Docopilot Checkpoints

⚙️ Data Preparation

Download Doc-750K (need about 1.5TB space)

mkdir data
cd data
# pip install -U huggingface_hub
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/Doc-750K --local-dir Doc-750K --repo-type dataset

# unzip each images folder
cd Doc-750K/openreview
unzip images.zip
cd ../generated
unzip images.zip
cd ../arxivqa
unzip images.zip
cd ../scihub
unzip images.zip

Custom your own training data (Optional)

Follow this link to prepare your own training data.

Notice: Put the meta in a single json, similar to playground/Doc-750K.json.

🔥 Supervised Finetuning

Pretrained Model Preparation

Our models are finetuned from InternVL2-2B and InternVL2-8B. Please download the above model weights and place them in the pretrained/ folder.

model name type download size
InternVL2-2B VLM 🤗 HF link 4.4 GB
InternVL2-8B VLM 🤗 HF link 16 GB
mkdir pretrained
cd pretrained/
# pip install -U huggingface_hub
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-2B --local-dir InternVL2-2B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-8B --local-dir InternVL2-8B

Training

sh shell/slurm_train_example.sh

📦 Model Zoo

model name type download size
Docopilot-2B VLM 🤗 HF link 4.4 GB
Docopilot-8B VLM 🤗 HF link 16 GB

🖊️ Citation

If you find this work helpful in your research, please consider citing:

@inproceedings{duan2025docopilot,
  title={Docopilot: Improving Multimodal Models for Document-Level Understanding},
  author={Duan, Yuchen and Chen, Zhe and Hu, Yusong and Wang, Weiyun and Ye, Shenglong and Shi, Botian and Lu, Lewei and Hou, Qibin and Lu, Tong and Li, Hongsheng and others},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={4026--4037},
  year={2025}
}

About

[CVPR 2025] Docopilot: Improving Multimodal Models for Document-Level Understanding

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published