Feature Proposal : heuristic-based Outline-extractor #4652

Indecre · 2025-08-16T05:17:42Z

Indecre
Aug 16, 2025

Title: Feature Proposal: Heuristic-based Outline Extractor

The Problem

Many PDF documents, especially academic papers or scanned books, lack an embedded Table of Contents (ToC). This makes programmatic navigation and content analysis difficult, as doc.get_toc() returns an empty list. Users currently have no built-in way to quickly understand the structure of such documents.

Proposed Solution

I've developed a Python tool, PDFOutlineExtractor, that heuristically generates a hierarchical outline from a PDF by analyzing its content and structure. This would allow users to get a high-quality ToC even when one isn't formally embedded in the file.

The core of the solution is a set of heuristics that identify potential headings based on common document formatting conventions. These include:

Font Size & Style: Identifying text that is significantly larger or bolder than the estimated body text.

Text Patterns: Recognizing numbered headings (e.g., "1. Introduction", "2.1 Methods", "IV. Conclusion").

Positional Analysis: Excluding text in common header/footer regions.

Document Metadata: Using the document's metadata as a starting point for the main title.

The output is a hierarchical list of nodes, structured identically to the output of doc.get_toc(), ensuring a consistent and familiar API for existing users.

Proposed API

I envision this being added as a new method to the fitz.Document class, making it incredibly easy to use. For example:

Python

import fitz

doc = fitz.open("document_without_toc.pdf")

The new method would generate the outline

heuristic_toc = doc.extract_outline()

The output format would match the existing get_toc()

[[1, 'Introduction', 1], [2, 'Chapter 1', 5], [3, 'Section 1.1', 6]]
print(heuristic_toc)

Next Steps

This is my first time contributing to a project of this scale, and I'm eager to help. I have a working, documented implementation ready.

If this is a feature the maintainers would be interested in, I would be happy to open a pull request. I'm fully open to feedback on the approach, the proposed API, and any necessary changes to align with the project's standards.

Thank you for your time and for maintaining this fantastic library!

I have attached a sample input and output for my working implementation
2401.00368v3.pdf
2401.00368v3.json

JorjMcKie · 2025-08-18T07:53:25Z

JorjMcKie
Aug 18, 2025
Maintainer

We welcome contributions from our users. A big thank your for interest and initiative.

You have chosen a subject that is one of the most challenging in all of document understanding. To identify a ToC from document content with sufficient reliability actually requires to understand the whole document - a task that almost nobody has solved yet in the market. Approximate solutions that I have seen are all AI-based in one way or another. Even then, such tools do not dare to state that they can correctly identify chapter / section header levels. All they usually can find out out is that a line is at least some "title".
I must confess that I haven't looked at your code yet. But your example PDF is quite close to an ideal candidate - that is to say: easy, without overly challenging details.

What about headings
Can you demonstrate that you can correctly recreate the ToC of Adobe manuals?
What about PDFs that contain a text-based TOC - usually found among the first pages? Or last?
What about PDFs with tables or images cross-references?
etc.

1 reply

Indecre Aug 18, 2025
Author

Thank you so much for your insightful feedback and for taking the time to review the project. I truly appreciate your perspective.

You are absolutely right—this is one of the most challenging problems in document understanding. My goal with this project isn't to achieve full semantic comprehension, but to provide a robust heuristic-based tool that can generate a useful and accurate outline for a wide range of common document layouts, especially those lacking embedded bookmarks.

You've raised some excellent points about the limitations and challenges. Let me address them based on how the current implementation works.

On Handling Complex Headings (e.g., Adobe Manuals)
This is a great stress test. While the example PDF was straightforward, the logic is built to handle more complex, consistently styled documents like technical manuals. The strategy is multi-layered:

Numbered Headings: The extractor first looks for explicit numbering patterns like 2., 2.1., or IV. using the NUM_PAT regex. In _level_for_line, these patterns are given the highest priority to determine the heading's level (e.g., 2.1.3 is correctly assigned level 3).

Font Size Hierarchy: For unnumbered headings, the script identifies all unique font sizes used by heading candidates and sorts them. The _compute_size_levels function creates a mapping where the largest font size corresponds to level 1, the second largest to level 2, and so on. Adobe manuals are typically well-structured with consistent font sizes for chapters, sections, and subsections, making this method very effective.

Stylistic Cues: As a final fallback, the code uses boldness (_is_bold) and relative size (ln.size >= body_size - 0.5) to identify lower-level headings that might not have a distinct font size but are clearly styled to stand out.

So, while it doesn't "understand" the content, it decodes the document's visual hierarchy, which should allow it to recreate the ToC of a well-formatted manual with high accuracy.

On Text-Based Table of Contents Pages
You've identified a key area for future improvement. The current version does not have a specific feature to detect and parse a text-based ToC page (i.e., a page at the beginning with entries and page numbers).

The extractor would process that page like any other, and its lines might be individually identified as heading candidates. However, it wouldn't recognize the page itself as a ToC to be used as the primary source.

Implementing this would be a fantastic addition. It would likely involve a pre-processing step that scans the first few pages for patterns indicative of a ToC (e.g., many lines with dot leaders, text aligned on the left, and page numbers aligned on the right). Thank you for this excellent suggestion! Users can extract ToC using just RE .

On Cross-References to Tables and Images
This is a crucial detail, and I'm happy to say the extractor already accounts for this. I noticed early on that captions like "Figure 1.2" or "Table 3" were often mistaken for section headers due to their styling.

To prevent this, the _is_heading_candidate function uses a regular expression:

Python

CAPTION_PAT = re.compile(r"^(figure|fig.|table|tab.)\s*\d+", re.I)
Any line matching this pattern is explicitly excluded from being considered a heading, which significantly cleans up the output and avoids polluting the outline with figure and table references.

Thank you again for your thoughtful and constructive feedback. It's incredibly valuable and gives me clear directions for strengthening the tool. I appreciate your interest and initiative.

Lastly I want to bring this to open source so that a solution can start to develop, and I thought fitz would be a great place to start

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Proposal : heuristic-based Outline-extractor #4652

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Feature Proposal : heuristic-based Outline-extractor #4652

Uh oh!

Uh oh!

Indecre Aug 16, 2025

Title: Feature Proposal: Heuristic-based Outline Extractor

The Problem

Proposed Solution

Proposed API

The new method would generate the outline

The output format would match the existing get_toc()

Next Steps

Replies: 1 comment · 1 reply

Uh oh!

JorjMcKie Aug 18, 2025 Maintainer

Uh oh!

Indecre Aug 18, 2025 Author

Indecre
Aug 16, 2025

Replies: 1 comment 1 reply

JorjMcKie
Aug 18, 2025
Maintainer

Indecre Aug 18, 2025
Author