The official implementation of the paper "Docopilot: Improving Multimodal Models for Document-Level Understanding".
-
We construct
Doc-750K
, the first large-scale, high-quality dataset for document-level multimodal understanding, with 758K QA pairs covering 9 task types. -
We propose
Docopilot
, a native document-level VLM that outperforms existing methods and Gemini-1.5-Pro on MMLongBench-Doc, making it the closest open-source model to GPT-4o. -
Docopilot
achieves much lower inference latency than RAG-based methods, and when combined with RAG, its performance further improves, showing that RAG effectively enhances its retrieval and reasoning.
- Release Evaluation Code
- Release Training Code
- Release
Doc-750K
- Release
Docopilot
Checkpoints
Download Doc-750K
(need about 1.5TB space)
mkdir data
cd data
# pip install -U huggingface_hub
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/Doc-750K --local-dir Doc-750K --repo-type dataset
# unzip each images folder
cd Doc-750K/openreview
unzip images.zip
cd ../generated
unzip images.zip
cd ../arxivqa
unzip images.zip
cd ../scihub
unzip images.zip
Follow this link to prepare your own training data.
Notice: Put the meta in a single json, similar to playground/Doc-750K.json
.
Our models are finetuned from InternVL2-2B
and InternVL2-8B
.
Please download the above model weights and place them in the pretrained/
folder.
model name | type | download | size |
---|---|---|---|
InternVL2-2B | VLM | 🤗 HF link | 4.4 GB |
InternVL2-8B | VLM | 🤗 HF link | 16 GB |
mkdir pretrained
cd pretrained/
# pip install -U huggingface_hub
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-2B --local-dir InternVL2-2B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-8B --local-dir InternVL2-8B
sh shell/slurm_train_example.sh
model name | type | download | size |
---|---|---|---|
Docopilot-2B | VLM | 🤗 HF link | 4.4 GB |
Docopilot-8B | VLM | 🤗 HF link | 16 GB |
If you find this work helpful in your research, please consider citing:
@inproceedings{duan2025docopilot,
title={Docopilot: Improving Multimodal Models for Document-Level Understanding},
author={Duan, Yuchen and Chen, Zhe and Hu, Yusong and Wang, Weiyun and Ye, Shenglong and Shi, Botian and Lu, Lewei and Hou, Qibin and Lu, Tong and Li, Hongsheng and others},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={4026--4037},
year={2025}
}