fix: prevent KeyError in reading order prediction and preserve PageElement type #139
+212
−6
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue resolved by this Pull Request:
Resolves the unprecedented KeyError occurring during PDF conversion in reading order prediction.
Description
The reading order predictor raised
KeyError: 22(or other indices) when processing certain PDFs. Two bugs were identified:l2r_mapchain could produce indices outside the valid range forstate.dn_mapandstate.up_mapPageElementobjects by converting them toBoundingBox, losingcid,label, and other critical attributesChanges:
Added bounds check before dictionary access in
_init_ud_maps:Fixed coordinate conversion to preserve type:
Checklist:
Warning
Firewall rules blocked me from connecting to one or more addresses (expand for details)
I tried to connect to the following addresses, but was blocked by firewall rules:
astral.shcurl -LsSf REDACTED(dns block)huggingface.co/home/REDACTED/work/docling-ibm-models/docling-ibm-models/.venv/bin/python /home/REDACTED/work/docling-ibm-models/docling-ibm-models/.venv/bin/pytest tests/test_reading_order.py -v --tb=short(dns block)/home/REDACTED/work/docling-ibm-models/docling-ibm-models/.venv/bin/python /home/REDACTED/work/docling-ibm-models/docling-ibm-models/.venv/bin/pytest tests/ -k not test_readingorder -x --tb=short(dns block)If you need me to access, download, or install something from one of these locations, you can either:
Original prompt
This section details on the original issue you should resolve
<issue_title>Unprecedented KeyError occuring</issue_title>
<issue_description>### Bug
I was using docling to convert pdf file, and noticed KeyError originating from docling from the line
doc = converter.convert(pdf_path)
the pdf path contains the pdf file.
The error is occurring for same pdf but with different numbers on KeyError.
...
Steps to reproduce
Unfortunately cant share the same pdf.
...
Docling version
Docling version: 2.58.0
Docling Core version: 2.49.0
Docling IBM Models version: 3.10.1
Docling Parse version: 4.7.0
Python: cpython-312 (3.12.3)
Platform: Linux-6.14.0-1012-aws-x86_64-with-glibc2.39
...
Python version
File "home/src/extractors/pdf_markdown.py", line 79, in process_single_page
doc = converter.convert(pdf_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "home/path/venv./lib/python3.12/site-packages/pydantic/_internal/_validate_call.py", line 39, in wrapper_function
return wrapper(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^
File "home/path/venv./lib/python3.12/site-packages/pydantic/_internal/_validate_call.py", line 136, in call
res = self.pydantic_validator.validate_python(pydantic_core.ArgsKwargs(args, kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "home/path/venv./lib/python3.12/site-packages/docling/document_converter.py", line 237, in convert
return next(all_res)
^^^^^^^^^^^^^
File "home/path/venv./lib/python3.12/site-packages/docling/document_converter.py", line 260, in convert_all
for conv_res in conv_res_iter:
File "home/path/venv./lib/python3.12/site-packages/docling/document_converter.py", line 332, in _convert
for item in map(
File "home/path/venv./lib/python3.12/site-packages/docling/document_converter.py", line 379, in _process_document
conv_res = self._execute_pipeline(in_doc, raises_on_error=raises_on_error)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "home/path/venv./lib/python3.12/site-packages/docling/document_converter.py", line 402, in _execute_pipeline
conv_res = pipeline.execute(in_doc, raises_on_error=raises_on_error)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "home/path/venv./lib/python3.12/site-packages/docling/pipeline/base_pipeline.py", line 80, in execute
raise e
File "home/path/venv./lib/python3.12/site-packages/docling/pipeline/base_pipeline.py", line 73, in execute
conv_res = self._assemble_document(conv_res)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "home/path/venv./lib/python3.12/site-packages/docling/pipeline/standard_pdf_pipeline.py", line 153, in _assemble_document
conv_res.document = self.reading_order_model(conv_res)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "home/path/venv./lib/python3.12/site-packages/docling/models/readingorder_model.py", line 410, in call
sorted_elements = self.ro_model.predict_reading_order(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "home/path/venv./lib/python3.12/site-packages/docling_ibm_models/reading_order/reading_order_rb.py", line 108, in predict_reading_order
page_to_elems[page_no] = self._predict_page(elems)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "home/path/venv./lib/python3.12/site-packages/docling_ibm_models/reading_order/reading_order_rb.py", line 239, in _predict_page
self._init_ud_maps(page_elements)
File "home/path/venv./lib/python3.12/site-packages/docling_ibm_models/reading_order/reading_order_rb.py", line 366, in _init_ud_maps
self.dn_map[i].append(j)
~~~~~~~~~~~^^^
KeyError: 22
...
</issue_description>
<agent_instructions># Repository Guidelines
Project Structure & Module Organization
Docling evaluation toolkit lives in
docling_eval/. Typer CLI entry points sit incli/; dataset ingestion underdataset_builders/; metric logic inevaluators/; helpers come viaaggregations/andutils/. Provider adapters (AWS, Azure, Google) live inprediction_providers/. Benchmark configs and sample outputs are underbenchmarks/, while reference docs live indocs/. Tests mirror runtime modules intests/, reusingtests/data/fixtures and the git-ignoredtests/scratch/for generated artifacts.Important tools to handle CVAT XML annotations are under
cvat_tools/, and more tools to manage a full annotation campaign and evaluation undercampaign_tools(relying heavily oncvat_tools).Build, Test & Development Commands
Install dependencies with uv:
uv syncseeds the virt...✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.