feat: heuristic LaTeX recovery for formula regions (0.7.5)#42
Open
abimaelmartell wants to merge 4 commits into
Open
feat: heuristic LaTeX recovery for formula regions (0.7.5)#42abimaelmartell wants to merge 4 commits into
abimaelmartell wants to merge 4 commits into
Conversation
…ntary fonts (0.7.4) (#41) * fix(detector): correct false flags for CID-encoded text and supplementary fonts (0.7.4) The page classifier was over-aggressively flagging Mixed-PDF pages as needing OCR in three distinct cases. Each is fixed at the root in analyze_page_content / page_has_identity_h_no_tounicode / the looks_like_scan check. 1. has_vector_text false positives on dense layouts path_ops > text_ops*200 fired on pages with decorative paths (column borders, dividers) alongside real selectable text. Added a unique_alphanum_chars < 30 guard: real outlined-text pages have very few unique alphanum chars (each glyph is a path), while pages with real text + decorations have many. 2. Identity-H without ToUnicode flagged whole pages on supplementary fonts page_has_identity_h_no_tounicode would flag a page if any single Type0 font lacked ToUnicode and had no fallback CMap, even when the page's actual text came from other decodable fonts (Type1 with ToUnicode, etc.). Rewrote to track both undecodable Identity-H fonts AND other decodable fonts, only flagging when no decodable text font is present. 3. CID-encoded text with ToUnicode misclassified as scan looks_like_scan checked unique_alphanum_chars < 10 on raw string operand bytes. CID-encoded fonts (Type0 with ToUnicode) emit 2-byte CID values that aren't ASCII alphanum, so the metric is blind to them even when the text is fully decodable. Added a has_decodable_text_fonts signal: when a page has decodable fonts AND >= 10 text ops, the low alphanum count is treated as a CID encoding artifact rather than evidence of a scan. Validated against a broad PDF corpus: - 6 known false-positive pages now correctly classified as text - 22 previously-missed scan pages (cover/blank/photo) now correctly flagged for OCR - 0 regressions on truly-scanned PDFs (61/61 pages stay flagged) - All 437 existing tests pass; clippy clean Bumps NAPI package to 0.7.4. * test(detector): add unit tests for the three classifier fixes Adds 10 unit tests covering the heuristic changes: - has_vector_text alphanum guard - real text + decorative paths → not flagged - true outlined glyphs (low alphanum) → still flagged - page_has_identity_h_no_tounicode supplementary-font handling - undecodable Identity-H + decodable Type1 → not flagged (new) - undecodable Identity-H alone → still flagged (regression) - page_has_decodable_text_fonts (new helper) - Type1 → true - Type0 with ToUnicode → true - undecodable Identity-H only → false - looks_like_scan with has_decodable_text_fonts override - CID-encoded decodable text → not flagged as scan - same metrics with no decodable fonts → still flagged - decodable fonts but text_ops < 10 (page-number overlay) → still flagged * fix(detector): make decodable-font checks usage-based and XObject-aware Addresses two reviewer concerns on the previous heuristic fix: P1 — resource-based check could create an inverse bug page_has_identity_h_no_tounicode and page_has_decodable_text_fonts iterated all fonts in the page Resources dict, including unused fonts. A page whose actual text was rendered exclusively in an undecodable Identity-H font but whose Resources also listed an unused decodable Type1 would be wrongly unflagged. Fix: parse Tf operator operands during content stream scanning to collect the set of font names actually referenced. The font checks now filter to only USED fonts via a new used_fonts_have_* family of functions operating on (used_font_names, font_map). P2 — checks didn't follow text into Form XObjects analyze_page_content correctly recurses through Form XObjects via scan_xobjects_in_resources, but the font checks only looked at the page's top-level Resources/Font. Pages that render text through Form XObjects (corporate templates, header/footer overlays) had their XObject font resources missed entirely. Fix: scan_xobjects_in_resources now propagates the used_font_names set AND collects fonts from each Form XObject's own Resources into the shared font_map. The usage-based check sees the full picture: page-level fonts + every nested XObject's fonts, intersected with fonts actually referenced by Tf operators anywhere in the content. Implementation: - New extract_font_name_before_tf helper (parses /Name immediately preceding Tf). - New FontInfo struct caches font properties per-name. - New collect_fonts_from_resource_dict + new used_fonts_have_* functions are pure filters over (used_names, font_map). - analyze_page_content threads used_font_names + font_map through page content scan and XObject recursion, then runs the new checks. - Old resource-based functions kept as #[cfg(test)] for the existing unit-test interface. - Phase 3 uncached-page loop now goes through analyze_page_content so it also gets the usage-based + XObject-aware behavior. Tests added (8): - extract_font_name_before_tf basic + long-name parsing - scan_content_for_text_operators collects used font names - P1 — unused decodable font in Resources doesn't save a page whose used font is undecodable - P1 — both fonts used → decodable font correctly prevents flag - P2 — decodable font inside Form XObject correctly unflags - P2 — undecodable font only in XObject still flags even with unused decodable font at page level - P2 — has_decodable_text_fonts populated from XObject fonts Validation: - 349 lib + 104 integration + 2 doc tests pass (was 341) - cargo clippy --lib --bin detect-pdf -- -D warnings: clean - External eval: 9/9 PDFs pass, 6/6 false positives resolved, 0 regressions, 61/61 scanned pages still correctly flagged - No eval delta — confirms previous fix wasn't relying on the resource-based bug for any of the eval PDFs * fix(detector): scope font lookups by ObjectId + handle indirect Form Resources Addresses two more reviewer findings on the previous decodable-font commit. P1 — Resource-name scoping bug The previous fix keyed used_font_names and font_map by raw resource names like b"F1". PDF resource names are scoped to each resource dictionary: a Form XObject can legally define its own /F1 that points to a completely different font from the page's /F1. Because collect_fonts_from_resource_dict skipped duplicates with `if font_map.contains_key(name)`, the first definition won and later Tf /F1 usages in different scopes resolved against the wrong font. This could reintroduce both the undecodable-Identity-H false flag and the decodable-CID false unflag depending on which side of the collision happened to be inserted first. Fix: switch the lookup mechanism from font names to font ObjectIds. - font_map: HashMap<ObjectId, FontInfo> (was Vec<u8> keys) - used_font_ids: HashSet<ObjectId> (was Vec<u8> names) - new resolve_font_names_to_ids() runs immediately after each content scan, against the resource dict in scope, to translate the per-scope name set into ObjectIds. Each Form XObject's content stream now resolves /F1 against THAT XObject's own Resources, so name collisions are impossible by design. Inline (no-ID) font dicts are skipped — extremely rare in practice and have no stable key. P2 — Indirect Form /Resources skipped scan_xobjects_in_resources used `.as_dict()` on the Form's /Resources entry, which returns None for indirect references. PDFs frequently store /Resources as `X 0 R`, in which case font collection and recursion were both skipped — even though the Tf usages inside the XObject content had already been recorded. Fix: handle Object::Reference(r) in addition to Object::Dictionary(d) by resolving via doc.get_dictionary. Audited the rest of the file — the other /Resources access points (analyze_page_images, collect_images_from_resources) already handled both cases. Tests added (4): - P1 same-name-different-font (page undecodable, XObject decodable): must NOT flag — XObject's text is decodable in its own scope. - P1 inverse (page decodable, XObject undecodable, content uses XObject /F1): MUST flag — undecodable text exists in real scope. - P2 indirect Form /Resources: font discovery must still work when /Resources is a `X 0 R` reference rather than inline. - Combined regression: indirect Resources + name collision. Validation: - cargo test --release: 459 tests pass (353 lib + 104 integration + 2 doc) - cargo clippy --lib --bin detect-pdf -- -D warnings: clean - external eval (9 PDFs): 9/9 pass, 6/6 false positives resolved, 0 regressions, 61/61 truly-scanned pages still flagged The behavior on the eval set is identical — confirms the correctness fix isn't masking any change in classifier outcomes. * fix(detector): respect resource shadowing when resolving page-content fonts The previous ObjectId-based fix correctly scoped Form XObject fonts but still violated PDF resource inheritance for page content. When a page overrides /F1 from a parent /Pages node (different font dict for the same name), get_page_resources returns the page's own /Resources plus all ancestor /Resources dicts. The old code called resolve_font_names_to_ids on each one and added every match to used_font_ids — both font ObjectIds ended up in the used set even though only the page's /F1 is actually visible to that page's content. Per ISO 32000-1 §7.7.3.4, resource names are inherited with shadowing semantics: the most-specific (deepest, closest to the page) definition wins. Fix: - New lookup_font_id helper resolves a single name in a single dict. - New resolve_with_shadowing iterates names, checking the page's own /Resources first, then walking ancestors in most-specific-first order (which is the order lopdf's get_page_resources returns). First hit wins via a labeled `continue 'name` — subsequent ancestors are skipped for that name. - analyze_page_content's flat resolution loop replaced with one call to resolve_with_shadowing. Audit: - XObject path is correct: each Form XObject already resolves names against its OWN /Resources (XObjects don't inherit from page tree). - font_map population is correct: keyed by ObjectId, so collecting from all dicts builds the full available-fonts catalog. The bug was only in the used-set resolution. - Confirmed lopdf returns ancestors in most-specific-first order (page → parent → grandparent → root), matching the shadowing direction used here. Tests added (3): - page /F1 undecodable shadows parent's decodable /F1 → MUST flag - page /F1 decodable shadows parent's undecodable /F1 → MUST NOT flag - no override: page inherits parent's decodable /F1 → MUST NOT flag Validation: - cargo test --release: 462 tests pass (356 lib + 104 integration + 2 doc) - cargo clippy --lib --bin detect-pdf -- -D warnings: clean - external eval: 9/9, 0 regressions, 6/6 false positives resolved, 61/61 scanned pages still correctly flagged
…n' into feat/formula-latex-recovery
Adds extract_formulas_in_regions_as_latex (and NAPI export
extractFormulasInRegionsAsLatex) — converts the linearized text from
formula bboxes into LaTeX using positioned text data, with a
calibrated confidence score so callers can gate on quality.
Pipeline (per formula bbox):
1. Pull positioned text items inside the bbox via the existing
positioned-text extractor
2. Apply unicode → LaTeX char mapping (160+ symbols: Greek,
operators, relations) from a dedicated unicode_map module
3. Detect simple structure from item geometry — sub/superscripts
by font-size + y-baseline, basic two-row fractions
4. Score the result with positive points for clean conversion,
plus penalties for failure modes that produce broken output
Confidence is honest, not optimistic. Penalties applied:
- many items (>15) and very many (>25)
- 3+ distinct y-bands (multi-row display equations)
- fraction fired but denominator x-range much wider than
numerator (cross-equation false positive)
- fraction fired but denominator starts well to the left
(likely separate expression below)
- large operator (∫ ∑ ∏ √ etc) bigger than 1.3× median —
these need bounded-operator structure detection (Phase 2)
- mixed font sizes within a single y-band
The high-confidence band (>0.85) on a formula-heavy academic test
PDF dropped from 81% → 38% after recalibration. Manual inspection
confirms the new high-confidence band contains structurally-correct
LaTeX only — no false positives. Mid (0.5-0.85) and low (<0.5)
bands hold the cases where structural reconstruction is uncertain
or broken; callers should fall back to OCR for those.
Also includes:
- probe-formulas-latex eval binary (compares raw text vs LaTeX
side-by-side with confidence breakdown for quality inspection)
- probe-formulas eval binary (validates raw extract API)
- 8 new unit tests for penalty calculation
- 15 + 10 unit tests for reconstruction and unicode mapping
Bumps NAPI package to 0.7.5. Builds on extractFormulasInRegions
from the previous formula-extraction feature.
The probe-formulas and probe-formulas-latex binaries were used to validate the LaTeX recovery during development but are not needed by consumers of the library. Removing them also drops the serde runtime dependency, which was only required by the probes. The probes can live as standalone scripts outside the published crate. No production behavior change. All 393 lib + 104 integration + 2 doc tests still pass; clippy clean.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
extractFormulasInRegionsAsLatex(NAPI) andextract_formulas_in_regions_as_latex(Rust) — converts the linearized text from formula bboxes into LaTeX using positioned text data, with a calibrated confidence score so callers can gate on quality.Builds on top of
extractFormulasInRegionsfrom #40. Bumps NAPI package to0.7.5.Pipeline (per formula bbox)
unicode_mapmoduleCalibrated confidence
The calibration uses additive penalties (clamped to
[0, 1]) that fire on the failure modes manual inspection surfaced:FormulaResult.confidence_breakdown: Vec<String>lists which penalties fired for each formula — useful for inspection and tuning.Validation
Tested against PP-DocLayoutV3's actual formula bboxes from a formula-heavy academic test PDF (74 formulas).
Manual inspection of the recalibrated bands:
dW1\ndω→\frac{dW1}{d\omega},region z ≪ 1.→region z \ll 1.Quality note
This is intended as a Phase 1 baseline. The current heuristic does not handle:
\begin{matrix}...\end{matrix})\begin{align}equations\int_a^b,\sum_{i=0}^n)\left( ... \right))These are deliberately Phase 2 territory. They produce low-confidence output today so callers can route them to OCR.
API surface
NAPI mirror:
extractFormulasInRegionsAsLatex(buffer, pageRegions): PageFormulaLatexResults[].Test plan
cargo clippy --lib --bin detect-pdf --bin pdf2md --bin dump_ops --bin probe-formulas --bin probe-formulas-latex -- -D warningscleanprobe-formulas-latexbinary added for manual quality inspection (compares raw-text vs LaTeX side-by-side with penalty breakdown per formula)