OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation

[📜 arXiv] | [Dataset (🤗Hugging Face)] | [Dataset (OpenDataLab)]

This repository contains the official code of OHR-Bench, a benchmark designed to evaluate the cascading impact of OCR on RAG.

News

2025.6.30: Updating the results of MongkeyOCR, Nanonets-OCR-s and Azure Document Intelligence.
2025.6.26: OHR-Bench has been accepted by ICCV 2025!

Overview

PDF, gt structured data and Q&A datasets: [🤗 Hugging Face] pdfs.zip, data/retrieval_base/gt, data/qas_v2.json. It includes 8500+ unstructured PDF pages from 7 domains, including Textbook, Law, Finance, Newspaper, Manual, Academic and Administration and 8498 Q&A datasets sourced from 5 key components for OCR in document parsing, including plain text, table, formula, chart and reading order. Each PDF page is equipped with a human-verified ground truth structured data.
Perturbed data with OCR errors: [🤗 Hugging Face] formatting_noise_[mild/moderate/severe] and semantic_noise_[GOT/MinerU/Qwen2.5-VL-72B]_[mild/moderate/severe]. In order to conduct in-depth analysis of the OCR's impact on RAG, OHR-Bench identifies Semantic Noise and Formatting Noise and introduce them with mild, moderate and severe perturbation based on real-world OCR errors.
Evaluation framework: [Github opendatalab/OHR-Bench]. We provide a RAG evaluation framework to assess the impact of OCR processed structured data and our perturbed data on RAG including retrieval, generation and overall performance.

Evaluation Results

	OCR	Retrieval						Generation						Overall
	E.D.↓	TXT↑	TAB↑	FOR↑	CHA↑	RO↑	ALL↑	TXT↑	TAB↑	FOR↑	CHA↑	RO↑	ALL↑	TXT↑	TAB↑	FOR↑	CHA↑	RO↑	ALL↑
Ground Truth	-	81.6	69.8	75.2	70.3	9.8	70.4	49.3	46.0	34.0	47.0	28.2	43.8	44.9	34.6	28.0	32.9	18.7	36.0
Pipeline-based OCR
MinerU-0.9.3	0.24	68.1	48.6	51.3	16.5	5.9	50.5	45.7	39.3	28.6	9.7	29.5	36.6	41.4	28.5	23.0	9.3	17.8	29.9
Marker-1.2.3	0.28	75.5	58.2	55.5	20.0	5.9	57.0	44.4	37.8	27.8	10.9	26.2	35.9	40.1	28.1	22.3	10.0	16.2	29.4
Azure	-	78.0	59.4	55.2	45.2	5.8	60.6	41.0	37.1	27.1	22.9	27.3	35.0	37.5	28.1	22.6	15.1	17.5	28.9
End-to-end OCR
GOT	0.27	62.5	41.1	49.0	17.4	3.7	45.8	37.5	28.5	24.1	8.5	7.1	27.8	35.3	22.9	20.1	8.2	5.3	24.5
Nougat	0.34	59.5	32.8	44.3	11.3	4.4	41.2	36.6	22.9	22.9	6.4	6.9	25.5	33.5	18.4	19.4	5.8	3.6	14.5
Vision-Language Model for OCR
Qwen2.5-VL-72B	0.18	75.1	60.0	60.0	38.2	5.3	59.6	44.3	42.1	31.8	27.0	11.6	37.5	40.6	31.1	26.1	19.0	8.8	31.1
InternVL2.5-78B	0.28	68.6	57.9	55.6	45.1	2.7	56.2	41.7	41.8	29.0	33.6	3.3	35.8	38.2	31.0	23.3	22.9	3.1	29.6
olmOCR-7B-0225-preview	0.21	72.5	58.4	55.4	24.8	5.0	56.6	44.8	40.5	30.4	19.0	8.4	36.0	40.6	30.3	23.7	12.8	7.1	29.6
MonkeyOCR	-	74.6	56.5	55.5	16.5	5.7	55.9	40.3	36.5	25.9	7.9	25.0	32.8	35.4	27.3	20.7	6.5	16.3	26.7
Nanonets-OCR-s	-	71.8	59.8	57.4	43.7	4.4	58.3	38.2	36.3	28.0	25.7	7.8	32.4	34.9	27.6	22.7	18.6	7.1	27.2

Notes: The subpar performance of Azure Document Intelligence may be related to the use of Llama3.1-8B as the generator. We are currently investigating this issue and plan to update the results using more advanced LLMs as generators.

We evaluate the suitability of current OCR solutions for real-world RAG applications by conducting comprehensive experiments with our OHR-Bench. We report the generalized LCS or F1 of five types of evidence sources, including plain text (TXT), table (TAB), formula (FOR), chart (CHA), and reading order (RO).

We derive conclusions as follows:

VLMs for OCR achieve the best overall performance. Employing Qwen2.5-VL-72B achieves the best performance across all OCR solutions.
All OCR solutions suffer performance degradation. Even the best solutions show a decrease of 14% F1-score in the overall evaluation, with greater losses in the retrieval and generation stages.

Getting Started

Installation

pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu121

Dataset preparation

OCR processed structured data

To evaluate your RAG system on our benchmark, follow these steps:

Download Perturbed Data: Get the data with formatting and semantic noise from the zip file in Hugging Face and unzip it. Or use the load_dataset ("opendatalab/OHR-Bench") to get the relevant fields.
Organize the Data: Place the folders retrieval_base/formatting_noise_[mild/moderate/severe] and retrieval_base/semantic_noise_[GOT/MinerU/Qwen2.5-VL-72B]_[mild/moderate/severe] in the data/retrieval_base directory of this project.
Run Evaluation: Follow the instructions in Run Evaluation.

To evaluate your OCR results using this benchmark:

Organize the Data: Do OCR with your OCR models (PDFs available on Hugging Face) and place the OCR processed structured data in the data/retrieval_base directory. Use the ground truth (data/retrieval_base/gt) data as an example. The sub-folder names indicate the domain of the parsed results, and each JSON file, named as the same of corresponding PDF files, should contain the corresponding parsed results.
Run Evaluation: Follow the instructions in Run Evaluation.

Directory Structure

retrieval_base/gt/ # We provide gt and MinerU processed structured data as illustration here
├── finance # Domain
│   ├── 3M_2023Q2_10Q.json # Parsed results
│   ├── ...
├── textbook
...

OCR Processed Data

[
    {
        "page_idx": 0, // Page index
        "text": "...", // OCR processed structured data
    },
    ...
]

QA data

The qa data is placed in data/qas_v2.json. Each JSON file should be structured as follows:

Q&A JSON

[
    {
        "doc_name": "finance/JPMORGAN_2021Q1_10Q", // Document source
        "ID": "00073cc2-c801-467c-9039-fca63c78c6a9", // Unique ID
        "questions": "What was the total amount of nonaccrual loans retained as of March 31, 2021?",
        "answers": "842",
        "doc_type": "finance", // Q&A domain.
        "answer_form": "Numeric", // Answer format.
        "evidence_source": "table", // Evidence source.
        "evidence_context": "Nonaccrual loans retained $^{(\\mathrm{a})}$ & \\$ & 842 & \\$ & 689 & $22 \\%$", // Evidence.
        "evidence_page_no": 24
    },
    ...
]

LLMs preparation

In src/configs, configure your local LLM path or GPT API.

GPT_api_key = 'You KEY Here'  # openai.api_key
...
Qwen2_7B_local_path = 'Qwen/Qwen2-7B-Instruct' # download from Hugging Face or your local path

Run Evaluation

To evaluate your OCR results, follow the instructions in the Dataset Preparation section to organize your OCR data.

# The first argument specifies which OCR results to use for evaluation.
# The second argument specifies the retrievers or LLMs.

# Args: Document source, LLM
# Generation with gt
bash shell/generation.sh gt qwen2_7b
# Generation with mild semantic noise usi (OCR=MinerU)
bash shell/generation.sh semantic_noise_MinerU_mild qwen2_7b

# Args: Document source, retriver
# Retrieval with gt
bash shell/retrieval.sh gt bge-m3
# Retrieval with moderate semantic noise (OCR=MinerU)
bash shell/retrieval.sh semantic_noise_MinerU_moderate bge-m3

# Args: Document source, retriver, LLM
# End-to-end with gt
bash shell/end2end.sh gt bge-m3 qwen2_7b
# End-to-end with severe semantic noise (OCR=MinerU)
bash shell/end2end.sh semantic_noise_MinerU_severe bge-m3 qwen2_7b

You can then use exp_scripts/exp_show.ipynb to view the results grouped by domain or evidence_source.

Acknowledgement

The evaluation framework is based on CRUD, thanks so much for this brilliant project.

Citation

@article{zhang2024ocr,
  title={OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation},
  author={Junyuan Zhang and Qintong Zhang and Bin Wang and Linke Ouyang and Zichen Wen and Ying Li and Ka-Ho Chow and Conghui He and Wentao Zhang},
  journal={arXiv preprint arXiv:2412.02592},
  year={2024}
}

Copyright Statement

The PDFs are collected from public online channels and community user contributions. Content that is not allowed for distribution has been removed. The dataset is for research purposes only and not for commercial use. If there are any copyright concerns, please contact [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
data		data
exp_scripts		exp_scripts
figs		figs
shell		shell
src		src
.gitignore		.gitignore
README.md		README.md
evaluator.py		evaluator.py
quick_start.py		quick_start.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation

News

Overview

Evaluation Results

Getting Started

Installation

Dataset preparation

OCR processed structured data

QA data

LLMs preparation

Run Evaluation

Acknowledgement

Citation

Copyright Statement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

opendatalab/OHR-Bench

Folders and files

Latest commit

History

Repository files navigation

OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation

News

Overview

Evaluation Results

Getting Started

Installation

Dataset preparation

OCR processed structured data

QA data

LLMs preparation

Run Evaluation

Acknowledgement

Citation

Copyright Statement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages