DocHighlight is a large-scale, high-resolution dataset for document specular highlight removal, captured with a polarization-based acquisition pipeline across diverse real-world scenarios. The dataset is detailed in "Towards Real-World Document Specular Highlight Removal: The DocHighlight Dataset and DocSHRNet Method" (PRCV 2025), and the reference implementation DocSHRNet is available at π https://github.com/shallweiwei/DocSHRNet.
The dataset is available via the following links:
- 2,201 rigorously aligned highlight vs. highlight-free image pairs
- Average resolution of 2924 Γ 3672 (range: 1034Γ737 β 3468Γ4624)
- Covers books, magazines, multilingual text, and graphical content
- Captures real-world variations in document pose, illumination, and three camera devices
- Combines polarization imaging with manual quality verification for reliable ground truth
- π Non-commercial use only (CC BY-NC-SA 4.0).
If you find this dataset useful in your research, please consider citing our paper.