feat: heuristic LaTeX recovery for formula regions (0.7.5) by abimaelmartell · Pull Request #42 · firecrawl/pdf-inspector

abimaelmartell · 2026-04-15T23:02:30Z

Summary

Adds extractFormulasInRegionsAsLatex (NAPI) and extract_formulas_in_regions_as_latex (Rust) — converts the linearized text from formula bboxes into LaTeX using positioned text data, with a calibrated confidence score so callers can gate on quality.

Builds on top of extractFormulasInRegions from #40. Bumps NAPI package to 0.7.5.

Pipeline (per formula bbox)

Pull positioned text items inside the bbox via the existing positioned-text extractor
Apply Unicode → LaTeX char mapping (160+ symbols: Greek, operators, relations) via a dedicated unicode_map module
Detect simple structure from item geometry — sub/superscripts via font-size + y-baseline, basic two-row fractions
Score the result with positive points for clean conversion, plus penalties for failure modes that produce broken output

Calibrated confidence

The calibration uses additive penalties (clamped to [0, 1]) that fire on the failure modes manual inspection surfaced:

Penalty	Trigger	Value
MANY_ITEMS	item_count > 15	-0.20
VERY_MANY_ITEMS	item_count > 25 (additional)	-0.10
MULTI_BAND	3+ distinct y-clusters	-0.30
FRAC_MISMATCH (width)	fraction fired AND denominator x-range > 1.5× numerator	-0.30
FRAC_LEFT_SHIFT	fraction fired AND denominator starts > 0.5× numer-width to the left	-0.30
HUGE_OPERATOR	item with font_size > 1.3× median AND char in {∫ ∑ ∏ √ ∮ ∯ ∰}	-0.30
FONT_SIZE_VARIANCE	items in same y-band have > 2 distinct font sizes	-0.15

FormulaResult.confidence_breakdown: Vec<String> lists which penalties fired for each formula — useful for inspection and tuning.

Validation

Tested against PP-DocLayoutV3's actual formula bboxes from a formula-heavy academic test PDF (74 formulas).

Confidence band	Phase 1 (positive points only)	After calibration
HIGH (> 0.85)	60 (81%)	28 (37.8%)
MID (0.5 - 0.85)	16 (22%)	12 (16.2%)
LOW (< 0.5)	~24	34 (45.9%)
Mean confidence	~0.82	0.536

Manual inspection of the recalibrated bands:

HIGH — all spot-checked entries produce structurally-correct LaTeX. Example: dW1\ndω → \frac{dW1}{d\omega}, region z ≪ 1. → region z \ll 1.
MID — Greek/operator mapping correct, structural reconstruction uncertain. Caller-dependent: a strict orchestrator can fall back to OCR; a permissive one can use the LaTeX as-is.
LOW — multi-row display equations where linearization fails. Should fall back to OCR. Examples include 8-band mega-formulas where even raw text is jumbled.

Quality note

This is intended as a Phase 1 baseline. The current heuristic does not handle:

Matrices (\begin{matrix}...\end{matrix})
Multi-line \begin{align} equations
Bounded operators (\int_a^b, \sum_{i=0}^n)
Nested fractions
Delimiter pairs (\left( ... \right))

These are deliberately Phase 2 territory. They produce low-confidence output today so callers can route them to OCR.

API surface

pub fn extract_formulas_in_regions_as_latex(
    buffer: &[u8],
    page_regions: &[(u32, Vec<[f32; 4]>)],
) -> Result<Vec<PageFormulaResult>, PdfError>;

pub struct FormulaResult {
    pub latex: String,
    pub raw_text: String,
    pub confidence: f32,
    pub needs_ocr: bool,
    pub confidence_breakdown: Vec<String>,
}

NAPI mirror: extractFormulasInRegionsAsLatex(buffer, pageRegions): PageFormulaLatexResults[].

Test plan

All previously-passing tests still pass — 393 lib + 104 integration + 2 doc
33 new unit tests — 8 penalty tests + 15 reconstruction tests + 10 unicode-map tests
cargo clippy --lib --bin detect-pdf --bin pdf2md --bin dump_ops --bin probe-formulas --bin probe-formulas-latex -- -D warnings clean
probe-formulas-latex binary added for manual quality inspection (compares raw-text vs LaTeX side-by-side with penalty breakdown per formula)

…ntary fonts (0.7.4) (#41) * fix(detector): correct false flags for CID-encoded text and supplementary fonts (0.7.4) The page classifier was over-aggressively flagging Mixed-PDF pages as needing OCR in three distinct cases. Each is fixed at the root in analyze_page_content / page_has_identity_h_no_tounicode / the looks_like_scan check. 1. has_vector_text false positives on dense layouts path_ops > text_ops*200 fired on pages with decorative paths (column borders, dividers) alongside real selectable text. Added a unique_alphanum_chars < 30 guard: real outlined-text pages have very few unique alphanum chars (each glyph is a path), while pages with real text + decorations have many. 2. Identity-H without ToUnicode flagged whole pages on supplementary fonts page_has_identity_h_no_tounicode would flag a page if any single Type0 font lacked ToUnicode and had no fallback CMap, even when the page's actual text came from other decodable fonts (Type1 with ToUnicode, etc.). Rewrote to track both undecodable Identity-H fonts AND other decodable fonts, only flagging when no decodable text font is present. 3. CID-encoded text with ToUnicode misclassified as scan looks_like_scan checked unique_alphanum_chars < 10 on raw string operand bytes. CID-encoded fonts (Type0 with ToUnicode) emit 2-byte CID values that aren't ASCII alphanum, so the metric is blind to them even when the text is fully decodable. Added a has_decodable_text_fonts signal: when a page has decodable fonts AND >= 10 text ops, the low alphanum count is treated as a CID encoding artifact rather than evidence of a scan. Validated against a broad PDF corpus: - 6 known false-positive pages now correctly classified as text - 22 previously-missed scan pages (cover/blank/photo) now correctly flagged for OCR - 0 regressions on truly-scanned PDFs (61/61 pages stay flagged) - All 437 existing tests pass; clippy clean Bumps NAPI package to 0.7.4. * test(detector): add unit tests for the three classifier fixes Adds 10 unit tests covering the heuristic changes: - has_vector_text alphanum guard - real text + decorative paths → not flagged - true outlined glyphs (low alphanum) → still flagged - page_has_identity_h_no_tounicode supplementary-font handling - undecodable Identity-H + decodable Type1 → not flagged (new) - undecodable Identity-H alone → still flagged (regression) - page_has_decodable_text_fonts (new helper) - Type1 → true - Type0 with ToUnicode → true - undecodable Identity-H only → false - looks_like_scan with has_decodable_text_fonts override - CID-encoded decodable text → not flagged as scan - same metrics with no decodable fonts → still flagged - decodable fonts but text_ops < 10 (page-number overlay) → still flagged * fix(detector): make decodable-font checks usage-based and XObject-aware Addresses two reviewer concerns on the previous heuristic fix: P1 — resource-based check could create an inverse bug page_has_identity_h_no_tounicode and page_has_decodable_text_fonts iterated all fonts in the page Resources dict, including unused fonts. A page whose actual text was rendered exclusively in an undecodable Identity-H font but whose Resources also listed an unused decodable Type1 would be wrongly unflagged. Fix: parse Tf operator operands during content stream scanning to collect the set of font names actually referenced. The font checks now filter to only USED fonts via a new used_fonts_have_* family of functions operating on (used_font_names, font_map). P2 — checks didn't follow text into Form XObjects analyze_page_content correctly recurses through Form XObjects via scan_xobjects_in_resources, but the font checks only looked at the page's top-level Resources/Font. Pages that render text through Form XObjects (corporate templates, header/footer overlays) had their XObject font resources missed entirely. Fix: scan_xobjects_in_resources now propagates the used_font_names set AND collects fonts from each Form XObject's own Resources into the shared font_map. The usage-based check sees the full picture: page-level fonts + every nested XObject's fonts, intersected with fonts actually referenced by Tf operators anywhere in the content. Implementation: - New extract_font_name_before_tf helper (parses /Name immediately preceding Tf). - New FontInfo struct caches font properties per-name. - New collect_fonts_from_resource_dict + new used_fonts_have_* functions are pure filters over (used_names, font_map). - analyze_page_content threads used_font_names + font_map through page content scan and XObject recursion, then runs the new checks. - Old resource-based functions kept as #[cfg(test)] for the existing unit-test interface. - Phase 3 uncached-page loop now goes through analyze_page_content so it also gets the usage-based + XObject-aware behavior. Tests added (8): - extract_font_name_before_tf basic + long-name parsing - scan_content_for_text_operators collects used font names - P1 — unused decodable font in Resources doesn't save a page whose used font is undecodable - P1 — both fonts used → decodable font correctly prevents flag - P2 — decodable font inside Form XObject correctly unflags - P2 — undecodable font only in XObject still flags even with unused decodable font at page level - P2 — has_decodable_text_fonts populated from XObject fonts Validation: - 349 lib + 104 integration + 2 doc tests pass (was 341) - cargo clippy --lib --bin detect-pdf -- -D warnings: clean - External eval: 9/9 PDFs pass, 6/6 false positives resolved, 0 regressions, 61/61 scanned pages still correctly flagged - No eval delta — confirms previous fix wasn't relying on the resource-based bug for any of the eval PDFs * fix(detector): scope font lookups by ObjectId + handle indirect Form Resources Addresses two more reviewer findings on the previous decodable-font commit. P1 — Resource-name scoping bug The previous fix keyed used_font_names and font_map by raw resource names like b"F1". PDF resource names are scoped to each resource dictionary: a Form XObject can legally define its own /F1 that points to a completely different font from the page's /F1. Because collect_fonts_from_resource_dict skipped duplicates with `if font_map.contains_key(name)`, the first definition won and later Tf /F1 usages in different scopes resolved against the wrong font. This could reintroduce both the undecodable-Identity-H false flag and the decodable-CID false unflag depending on which side of the collision happened to be inserted first. Fix: switch the lookup mechanism from font names to font ObjectIds. - font_map: HashMap<ObjectId, FontInfo> (was Vec<u8> keys) - used_font_ids: HashSet<ObjectId> (was Vec<u8> names) - new resolve_font_names_to_ids() runs immediately after each content scan, against the resource dict in scope, to translate the per-scope name set into ObjectIds. Each Form XObject's content stream now resolves /F1 against THAT XObject's own Resources, so name collisions are impossible by design. Inline (no-ID) font dicts are skipped — extremely rare in practice and have no stable key. P2 — Indirect Form /Resources skipped scan_xobjects_in_resources used `.as_dict()` on the Form's /Resources entry, which returns None for indirect references. PDFs frequently store /Resources as `X 0 R`, in which case font collection and recursion were both skipped — even though the Tf usages inside the XObject content had already been recorded. Fix: handle Object::Reference(r) in addition to Object::Dictionary(d) by resolving via doc.get_dictionary. Audited the rest of the file — the other /Resources access points (analyze_page_images, collect_images_from_resources) already handled both cases. Tests added (4): - P1 same-name-different-font (page undecodable, XObject decodable): must NOT flag — XObject's text is decodable in its own scope. - P1 inverse (page decodable, XObject undecodable, content uses XObject /F1): MUST flag — undecodable text exists in real scope. - P2 indirect Form /Resources: font discovery must still work when /Resources is a `X 0 R` reference rather than inline. - Combined regression: indirect Resources + name collision. Validation: - cargo test --release: 459 tests pass (353 lib + 104 integration + 2 doc) - cargo clippy --lib --bin detect-pdf -- -D warnings: clean - external eval (9 PDFs): 9/9 pass, 6/6 false positives resolved, 0 regressions, 61/61 truly-scanned pages still flagged The behavior on the eval set is identical — confirms the correctness fix isn't masking any change in classifier outcomes. * fix(detector): respect resource shadowing when resolving page-content fonts The previous ObjectId-based fix correctly scoped Form XObject fonts but still violated PDF resource inheritance for page content. When a page overrides /F1 from a parent /Pages node (different font dict for the same name), get_page_resources returns the page's own /Resources plus all ancestor /Resources dicts. The old code called resolve_font_names_to_ids on each one and added every match to used_font_ids — both font ObjectIds ended up in the used set even though only the page's /F1 is actually visible to that page's content. Per ISO 32000-1 §7.7.3.4, resource names are inherited with shadowing semantics: the most-specific (deepest, closest to the page) definition wins. Fix: - New lookup_font_id helper resolves a single name in a single dict. - New resolve_with_shadowing iterates names, checking the page's own /Resources first, then walking ancestors in most-specific-first order (which is the order lopdf's get_page_resources returns). First hit wins via a labeled `continue 'name` — subsequent ancestors are skipped for that name. - analyze_page_content's flat resolution loop replaced with one call to resolve_with_shadowing. Audit: - XObject path is correct: each Form XObject already resolves names against its OWN /Resources (XObjects don't inherit from page tree). - font_map population is correct: keyed by ObjectId, so collecting from all dicts builds the full available-fonts catalog. The bug was only in the used-set resolution. - Confirmed lopdf returns ancestors in most-specific-first order (page → parent → grandparent → root), matching the shadowing direction used here. Tests added (3): - page /F1 undecodable shadows parent's decodable /F1 → MUST flag - page /F1 decodable shadows parent's undecodable /F1 → MUST NOT flag - no override: page inherits parent's decodable /F1 → MUST NOT flag Validation: - cargo test --release: 462 tests pass (356 lib + 104 integration + 2 doc) - cargo clippy --lib --bin detect-pdf -- -D warnings: clean - external eval: 9/9, 0 regressions, 6/6 false positives resolved, 61/61 scanned pages still correctly flagged

…n' into feat/formula-latex-recovery

Adds extract_formulas_in_regions_as_latex (and NAPI export extractFormulasInRegionsAsLatex) — converts the linearized text from formula bboxes into LaTeX using positioned text data, with a calibrated confidence score so callers can gate on quality. Pipeline (per formula bbox): 1. Pull positioned text items inside the bbox via the existing positioned-text extractor 2. Apply unicode → LaTeX char mapping (160+ symbols: Greek, operators, relations) from a dedicated unicode_map module 3. Detect simple structure from item geometry — sub/superscripts by font-size + y-baseline, basic two-row fractions 4. Score the result with positive points for clean conversion, plus penalties for failure modes that produce broken output Confidence is honest, not optimistic. Penalties applied: - many items (>15) and very many (>25) - 3+ distinct y-bands (multi-row display equations) - fraction fired but denominator x-range much wider than numerator (cross-equation false positive) - fraction fired but denominator starts well to the left (likely separate expression below) - large operator (∫ ∑ ∏ √ etc) bigger than 1.3× median — these need bounded-operator structure detection (Phase 2) - mixed font sizes within a single y-band The high-confidence band (>0.85) on a formula-heavy academic test PDF dropped from 81% → 38% after recalibration. Manual inspection confirms the new high-confidence band contains structurally-correct LaTeX only — no false positives. Mid (0.5-0.85) and low (<0.5) bands hold the cases where structural reconstruction is uncertain or broken; callers should fall back to OCR for those. Also includes: - probe-formulas-latex eval binary (compares raw text vs LaTeX side-by-side with confidence breakdown for quality inspection) - probe-formulas eval binary (validates raw extract API) - 8 new unit tests for penalty calculation - 15 + 10 unit tests for reconstruction and unicode mapping Bumps NAPI package to 0.7.5. Builds on extractFormulasInRegions from the previous formula-extraction feature.

The probe-formulas and probe-formulas-latex binaries were used to validate the LaTeX recovery during development but are not needed by consumers of the library. Removing them also drops the serde runtime dependency, which was only required by the probes. The probes can live as standalone scripts outside the published crate. No production behavior change. All 393 lib + 104 integration + 2 doc tests still pass; clippy clean.

abimaelmartell added 4 commits April 15, 2026 13:47

Merge remote-tracking branch 'origin/abimaelmartell/formula-extractio…

0b3b037

…n' into feat/formula-latex-recovery

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: heuristic LaTeX recovery for formula regions (0.7.5)#42

feat: heuristic LaTeX recovery for formula regions (0.7.5)#42
abimaelmartell wants to merge 4 commits into
abimaelmartell/formula-extractionfrom
feat/formula-latex-recovery

abimaelmartell commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abimaelmartell commented Apr 15, 2026

Summary

Pipeline (per formula bbox)

Calibrated confidence

Validation

Quality note

API surface

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant