-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Bug
META-NET PDFs (example Bulgarian] correctly finds text in the document but fails to detect title/section header, fails to understand the bi-lingual reading order (where columns span pages), and fails to pick up on a paragraph continuation to the next page.
The PDFs are authored in LaTeX (example Bulgarian).
Note there are 30 languages available, all with same format though different LaTeX approaches. https://european-language-equality.eu/meta-net-white-paper-series/
Steps to reproduce
import platform
from docling.datamodel.pipeline_options import ( # Additional imports for VLM
VlmPipelineOptions,
smoldocling_vlm_conversion_options,
smoldocling_vlm_mlx_conversion_options,
)
from docling.pipeline.vlm_pipeline import VlmPipeline
if (
"darwin" in platform.system().lower()
and "arm64" in platform.machine().lower()
): # optimized for Apple Silicon (MLX)
vlm_options = smoldocling_vlm_mlx_conversion_options
else:
vlm_options = smoldocling_vlm_conversion_options
vlm_pipeline_options = VlmPipelineOptions(
force_backend_text=False,
vlm_options=vlm_options,
)
converter_vlm = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_options=vlm_pipeline_options,
pipeline_cls=VlmPipeline,
)
}
)
vlm_result = converter_vlm.convert("https://european-language-equality.eu/wp-content/uploads/2024/12/bulgarian.pdf")
vlm_doc = vlm_result.document
from docling_core.transforms.visualizer.layout_visualizer import LayoutVisualizer
layout_visualizer = LayoutVisualizer()
page_images = layout_visualizer.get_visualization(doc=vlm_doc)
display(page_images[1]) # title is two languages, split down the middle
display(page_images[4]) # two-column bilingual layout
display(page_images[5]) # two-column bilingual layout continues
display(page_images[8]) # very long paragraph continues to next page
display(page_images[9])
Docling version
phil@Phils-MacBook-Pro docling % uv run docling --version
Docling version: 2.43.0
Docling Core version: 2.43.1
Docling IBM Models version: 3.9.0
Docling Parse version: 4.1.0
Python: cpython-312 (3.12.10)
Platform: macOS-15.5-arm64-arm-64bit
Python version
phil@Phils-MacBook-Pro docling % uv run python --version
Python 3.12.10
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working