Replies: 1 comment 1 reply
-
We welcome contributions from our users. A big thank your for interest and initiative. You have chosen a subject that is one of the most challenging in all of document understanding. To identify a ToC from document content with sufficient reliability actually requires to understand the whole document - a task that almost nobody has solved yet in the market. Approximate solutions that I have seen are all AI-based in one way or another. Even then, such tools do not dare to state that they can correctly identify chapter / section header levels. All they usually can find out out is that a line is at least some "title".
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Title: Feature Proposal: Heuristic-based Outline Extractor
The Problem
Many PDF documents, especially academic papers or scanned books, lack an embedded Table of Contents (ToC). This makes programmatic navigation and content analysis difficult, as doc.get_toc() returns an empty list. Users currently have no built-in way to quickly understand the structure of such documents.
Proposed Solution
I've developed a Python tool, PDFOutlineExtractor, that heuristically generates a hierarchical outline from a PDF by analyzing its content and structure. This would allow users to get a high-quality ToC even when one isn't formally embedded in the file.
The core of the solution is a set of heuristics that identify potential headings based on common document formatting conventions. These include:
Font Size & Style: Identifying text that is significantly larger or bolder than the estimated body text.
Text Patterns: Recognizing numbered headings (e.g., "1. Introduction", "2.1 Methods", "IV. Conclusion").
Positional Analysis: Excluding text in common header/footer regions.
Document Metadata: Using the document's metadata as a starting point for the main title.
The output is a hierarchical list of nodes, structured identically to the output of doc.get_toc(), ensuring a consistent and familiar API for existing users.
Proposed API
I envision this being added as a new method to the fitz.Document class, making it incredibly easy to use. For example:
Python
import fitz
doc = fitz.open("document_without_toc.pdf")
The new method would generate the outline
heuristic_toc = doc.extract_outline()
The output format would match the existing get_toc()
[[1, 'Introduction', 1], [2, 'Chapter 1', 5], [3, 'Section 1.1', 6]]
print(heuristic_toc)
Next Steps
This is my first time contributing to a project of this scale, and I'm eager to help. I have a working, documented implementation ready.
If this is a feature the maintainers would be interested in, I would be happy to open a pull request. I'm fully open to feedback on the approach, the proposed API, and any necessary changes to align with the project's standards.
Thank you for your time and for maintaining this fantastic library!
I have attached a sample input and output for my working implementation
2401.00368v3.pdf
2401.00368v3.json
Beta Was this translation helpful? Give feedback.
All reactions