Name	Name	Last commit message	Last commit date
Latest commit History 40 Commits
Code	Code
Experiments	Experiments
Images	Images
LICENSE	LICENSE
README.md	README.md

Evaluating Robustness of LLMs in Question Answering on Multilingual Noisy OCR Data

MultiOCR-QA is a large-scale multilingual QA dataset designed to evaluate how OCR noise—insertions, deletions, substitutions—affects Large Language Models (LLMs) in question answering. Unlike standard QA datasets, MultiOCR-QA provides both RawOCR (noisy OCR text) and CorrectedOCR (ground truth text), enabling direct measurement of robustness and testing of noise-mitigation strategies.

🗂 Overview

📌 Key Statistics

50,079 QA pairs across English, French, German..
Derived from centuries-old historical documents (via ICDAR 2019 dataset)
Each sample includes both RawOCR and CorrectedOCR contexts.

🌟 What Makes PlausibleQA Unique?

✅ Dual OCR Contexts: Direct comparison between noisy and clean text for every QA pair.

✅ Fine-grained Noise Profiling: Error categories (insertions, deletions, substitutions) and low/medium/high noise levels.

✅ Multilingual & Historical: Covers EN/FR/DE historical corpora with diverse OCR challenges.

✅ Robustness Benchmark: Evaluates state-of-the-art LLMs under realistic OCR distortions.

🔑 Research Contributions

Introduction of MultiOCR-QA:
- First large multilingual QA dataset for systematic OCR-noise evaluation.
- Features 50K QA pairs with paired noisy/clean contexts.
Comprehensive Model Evaluation
- Benchmarked Qwen, LLaMA, Gemma, Mixtral across EN/FR/DE.
- Shows consistent degradation from RawOCR vs CorrectedOCR.
Mitigation Strategies
- Explored context correction (fix noisy passages before QA).
- Compared with answer correction (post-process generated answers).
- Findings: Correcting context early is more effective than fixing answers afterward.

🗃️Dataset

Dataset Statistics

	English	French	German
#QA pairs	875	10,004	39,200
#Paragraphs	123	1,670	9,075
Average CorrectedOCR paragraph length (words)	271.73	297.53	212.86
Average RawOCR paragraph length (words)	263.46	335.73	193.23
Average question length (words)	8.60	8.73	8.08
Average answer length (words)	2.05	3.12	5.63
Average questions per paragraph	7.11	5.99	4.32

Data Structure:

{
    "document_id": "",
    "rawOCR_text": "",
    "correctedOCR_text": "",
    "QA_pairs": [
        {
            "q_id": "",
            "question": "",
            "answer": ""
        }
    ]
}

📥 Dataset Download

The dataset is available on HuggingFace:

English QA: Download
French QA: Download
German QA: Download

📂 Use Cases of PlausibleQA

Training noise-resilient LLMs:
- Improve robustness against OCR inaccuracies by exposing models to paired RawOCR vs. CorrectedOCR contexts.
Error correction research
- Develop and evaluate correction pipelines that fix OCR errors while preserving the archaic language structure of historical documents.
Multilingual robustness
- Expand LLMs’ capabilities beyond English by training and evaluating on English, French, and German OCR text.
Digital humanities & archives
- Enhance accessibility of centuries-old documents by enabling robust QA over noisy digitized collections.
Generalizable NLP research
- Use OCR noise as a case study for broader robustness, perturbation, and domain shift evaluations.

🪪License

This project is licensed under the MIT License - see the LICENSE file for details.

✨Citation

If you find this work useful, please cite 📜our paper:

Plain

Piryani, B., Mozafari, J., Abdallah, A., Doucet, A., & Jatowt, A. (2025). Evaluating Robustness of LLMs in Question Answering on Multilingual Noisy OCR Data. arXiv preprint arXiv:2502.16781

Bibtex

@article{piryani2025multiocr,
  title={Evaluating Robustness of LLMs in Question Answering on Multilingual Noisy OCR Data},
  author={Piryani, Bhawna and Mozafari, Jamshid and Abdallah, Abdelrahman and Doucet, Antoine and Jatowt, Adam},
  journal={arXiv preprint arXiv:2502.16781},
  year={2025}
}

🙏Acknowledgments

Thanks to our contributors and the University of Innsbruck for supporting this project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Evaluating Robustness of LLMs in Question Answering on Multilingual Noisy OCR Data

🗂 Overview

📌 Key Statistics

🌟 What Makes PlausibleQA Unique?

🔑 Research Contributions

🗃️Dataset

Dataset Statistics

📥 Dataset Download

📂 Use Cases of PlausibleQA

🪪License

✨Citation

Plain

Bibtex

🙏Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

DataScienceUIBK/MultiOCR-QA

Folders and files

Latest commit

History

Repository files navigation

Evaluating Robustness of LLMs in Question Answering on Multilingual Noisy OCR Data

🗂 Overview

📌 Key Statistics

🌟 What Makes PlausibleQA Unique?

🔑 Research Contributions

🗃️Dataset

Dataset Statistics

📥 Dataset Download

📂 Use Cases of PlausibleQA

🪪License

✨Citation

Plain

Bibtex

🙏Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages