-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Bug
META-NET PDFs (example Bulgarian] find no text in the document.
The PDFs are authored in LaTeX (example Bulgarian).
Steps to reproduce
# Create a converter instance
converter = DocumentConverter()
# Convert a document
result = converter.convert("https://european-language-equality.eu/wp-content/uploads/2024/12/bulgarian.pdf")
doc = result.document
# statistics
print(f"Document title: {doc.name}")
print(f"Number of pages: {len(doc.pages)}")
print(f"Number of tables: {len(doc.tables)}")
print(f"Number of pictures: {len(doc.pictures)}")
print(f"Number of texts: {len(doc.texts)}")
gives:
Document title: META-NET
Number of pages: 89
Number of tables: 14
Number of pictures: 39
Number of texts: 0
But with SmolDocling VLM:
Document title: Document
Number of pages: 89
Number of tables: 13
Number of pictures: 17
Number of texts: 946
(though it takes a lot longer, 9.7 sec/page vs 0.67 sec/page on M4 MacBook Pro, and there are other detection problems that are separate from this issue).
Docling version
phil@Phils-MacBook-Pro docling % uv run docling --version
Docling version: 2.43.0
Docling Core version: 2.43.1
Docling IBM Models version: 3.9.0
Docling Parse version: 4.1.0
Python: cpython-312 (3.12.10)
Platform: macOS-15.5-arm64-arm-64bit
Python version
phil@Phils-MacBook-Pro docling % uv run python --version
Python 3.12.10
Tekki
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working