Skip to content

wrong text order in paragraph #2016

@mieslep

Description

@mieslep

Bug

A challenging PDF has a very complicated object layer/construction and on page 7 are a number of paragraphs like this, all of which are rendered in the wrong order:

Image

Steps to reproduce

converter = DocumentConverter()
result = converter.convert("https://avys.omu.edu.tr/storage/app/public/seden.dogan/129943/tour%20book%20India.pdf")
doc = result.document

print(doc.export_to_markdown(page_no=7))

Note this text:

, the open air art gallery of Rajasthan, is a quaint village in the heart of Shekhawati region. Mandawa has preserved fine examples of frescoes and painted havelis, as masterpieces of art from bygone era. MANDAWA

But it should be:

MANDAWA, the open air art gallery of Rajasthan, is a quaint village in the heart of Shekhawati region. Mandawa has preserved fine examples of frescoes and painted havelis, as masterpieces of art from bygone era.

The "MANDAWA" is at the end of the line, rather than at the front.

Using (default) OCR does not resolve the problem:

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=PdfPipelineOptions(
                do_ocr=True,
                force_full_page_ocr=True,  
            )
        )
    }
)
result = converter.convert("https://avys.omu.edu.tr/storage/app/public/seden.dogan/129943/tour%20book%20India.pdf")

Docling version

phil@Phils-MacBook-Pro docling % uv run docling --version
Docling version: 2.43.0
Docling Core version: 2.43.1
Docling IBM Models version: 3.9.0
Docling Parse version: 4.1.0
Python: cpython-312 (3.12.10)
Platform: macOS-15.5-arm64-arm-64bit

Python version

phil@Phils-MacBook-Pro docling % uv run python --version
Python 3.12.10

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions