-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Open
Labels
Description
Bug
A challenging PDF has a very complicated object layer/construction and on page 7 are a number of paragraphs like this, all of which are rendered in the wrong order:

Steps to reproduce
converter = DocumentConverter()
result = converter.convert("https://avys.omu.edu.tr/storage/app/public/seden.dogan/129943/tour%20book%20India.pdf")
doc = result.document
print(doc.export_to_markdown(page_no=7))
Note this text:
, the open air art gallery of Rajasthan, is a quaint village in the heart of Shekhawati region. Mandawa has preserved fine examples of frescoes and painted havelis, as masterpieces of art from bygone era. MANDAWA
But it should be:
MANDAWA, the open air art gallery of Rajasthan, is a quaint village in the heart of Shekhawati region. Mandawa has preserved fine examples of frescoes and painted havelis, as masterpieces of art from bygone era.
The "MANDAWA" is at the end of the line, rather than at the front.
Using (default) OCR does not resolve the problem:
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_options=PdfPipelineOptions(
do_ocr=True,
force_full_page_ocr=True,
)
)
}
)
result = converter.convert("https://avys.omu.edu.tr/storage/app/public/seden.dogan/129943/tour%20book%20India.pdf")
Docling version
phil@Phils-MacBook-Pro docling % uv run docling --version
Docling version: 2.43.0
Docling Core version: 2.43.1
Docling IBM Models version: 3.9.0
Docling Parse version: 4.1.0
Python: cpython-312 (3.12.10)
Platform: macOS-15.5-arm64-arm-64bit
Python version
phil@Phils-MacBook-Pro docling % uv run python --version
Python 3.12.10