Skip to content

SmolDocling has layout problems on bilingual PDF #2014

@mieslep

Description

@mieslep

Bug

META-NET PDFs (example Bulgarian] correctly finds text in the document but fails to detect title/section header, fails to understand the bi-lingual reading order (where columns span pages), and fails to pick up on a paragraph continuation to the next page.

The PDFs are authored in LaTeX (example Bulgarian).

Note there are 30 languages available, all with same format though different LaTeX approaches. https://european-language-equality.eu/meta-net-white-paper-series/

Steps to reproduce

import platform
from docling.datamodel.pipeline_options import (  # Additional imports for VLM
    VlmPipelineOptions,
    smoldocling_vlm_conversion_options,
    smoldocling_vlm_mlx_conversion_options,
)
from docling.pipeline.vlm_pipeline import VlmPipeline

if (
    "darwin" in platform.system().lower()
    and "arm64" in platform.machine().lower()
):  # optimized for Apple Silicon (MLX)
    vlm_options = smoldocling_vlm_mlx_conversion_options
else:
    vlm_options = smoldocling_vlm_conversion_options

vlm_pipeline_options = VlmPipelineOptions(
    force_backend_text=False,
    vlm_options=vlm_options,
)

converter_vlm = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=vlm_pipeline_options,
            pipeline_cls=VlmPipeline,
        )
    }
)

vlm_result = converter_vlm.convert("https://european-language-equality.eu/wp-content/uploads/2024/12/bulgarian.pdf")
vlm_doc = vlm_result.document

from docling_core.transforms.visualizer.layout_visualizer import LayoutVisualizer

layout_visualizer = LayoutVisualizer()
page_images = layout_visualizer.get_visualization(doc=vlm_doc)

display(page_images[1]) # title is two languages, split down the middle
display(page_images[4]) # two-column bilingual layout 
display(page_images[5]) # two-column bilingual layout continues
display(page_images[8]) # very long paragraph continues to next page
display(page_images[9])

Docling version

phil@Phils-MacBook-Pro docling % uv run docling --version
Docling version: 2.43.0
Docling Core version: 2.43.1
Docling IBM Models version: 3.9.0
Docling Parse version: 4.1.0
Python: cpython-312 (3.12.10)
Platform: macOS-15.5-arm64-arm-64bit

Python version

phil@Phils-MacBook-Pro docling % uv run python --version
Python 3.12.10

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions