Skip to content

feat: heuristic LaTeX recovery for formula regions (0.7.5)#42

Open
abimaelmartell wants to merge 4 commits into
abimaelmartell/formula-extractionfrom
feat/formula-latex-recovery
Open

feat: heuristic LaTeX recovery for formula regions (0.7.5)#42
abimaelmartell wants to merge 4 commits into
abimaelmartell/formula-extractionfrom
feat/formula-latex-recovery

Conversation

@abimaelmartell

Copy link
Copy Markdown
Member

Summary

Adds extractFormulasInRegionsAsLatex (NAPI) and extract_formulas_in_regions_as_latex (Rust) — converts the linearized text from formula bboxes into LaTeX using positioned text data, with a calibrated confidence score so callers can gate on quality.

Builds on top of extractFormulasInRegions from #40. Bumps NAPI package to 0.7.5.

Pipeline (per formula bbox)

  1. Pull positioned text items inside the bbox via the existing positioned-text extractor
  2. Apply Unicode → LaTeX char mapping (160+ symbols: Greek, operators, relations) via a dedicated unicode_map module
  3. Detect simple structure from item geometry — sub/superscripts via font-size + y-baseline, basic two-row fractions
  4. Score the result with positive points for clean conversion, plus penalties for failure modes that produce broken output

Calibrated confidence

The calibration uses additive penalties (clamped to [0, 1]) that fire on the failure modes manual inspection surfaced:

Penalty Trigger Value
MANY_ITEMS item_count > 15 -0.20
VERY_MANY_ITEMS item_count > 25 (additional) -0.10
MULTI_BAND 3+ distinct y-clusters -0.30
FRAC_MISMATCH (width) fraction fired AND denominator x-range > 1.5× numerator -0.30
FRAC_LEFT_SHIFT fraction fired AND denominator starts > 0.5× numer-width to the left -0.30
HUGE_OPERATOR item with font_size > 1.3× median AND char in {∫ ∑ ∏ √ ∮ ∯ ∰} -0.30
FONT_SIZE_VARIANCE items in same y-band have > 2 distinct font sizes -0.15

FormulaResult.confidence_breakdown: Vec<String> lists which penalties fired for each formula — useful for inspection and tuning.

Validation

Tested against PP-DocLayoutV3's actual formula bboxes from a formula-heavy academic test PDF (74 formulas).

Confidence band Phase 1 (positive points only) After calibration
HIGH (> 0.85) 60 (81%) 28 (37.8%)
MID (0.5 - 0.85) 16 (22%) 12 (16.2%)
LOW (< 0.5) ~24 34 (45.9%)
Mean confidence ~0.82 0.536

Manual inspection of the recalibrated bands:

  • HIGH — all spot-checked entries produce structurally-correct LaTeX. Example: dW1\ndω\frac{dW1}{d\omega}, region z ≪ 1.region z \ll 1.
  • MID — Greek/operator mapping correct, structural reconstruction uncertain. Caller-dependent: a strict orchestrator can fall back to OCR; a permissive one can use the LaTeX as-is.
  • LOW — multi-row display equations where linearization fails. Should fall back to OCR. Examples include 8-band mega-formulas where even raw text is jumbled.

Quality note

This is intended as a Phase 1 baseline. The current heuristic does not handle:

  • Matrices (\begin{matrix}...\end{matrix})
  • Multi-line \begin{align} equations
  • Bounded operators (\int_a^b, \sum_{i=0}^n)
  • Nested fractions
  • Delimiter pairs (\left( ... \right))

These are deliberately Phase 2 territory. They produce low-confidence output today so callers can route them to OCR.

API surface

pub fn extract_formulas_in_regions_as_latex(
    buffer: &[u8],
    page_regions: &[(u32, Vec<[f32; 4]>)],
) -> Result<Vec<PageFormulaResult>, PdfError>;

pub struct FormulaResult {
    pub latex: String,
    pub raw_text: String,
    pub confidence: f32,
    pub needs_ocr: bool,
    pub confidence_breakdown: Vec<String>,
}

NAPI mirror: extractFormulasInRegionsAsLatex(buffer, pageRegions): PageFormulaLatexResults[].

Test plan

  • All previously-passing tests still pass — 393 lib + 104 integration + 2 doc
  • 33 new unit tests — 8 penalty tests + 15 reconstruction tests + 10 unicode-map tests
  • cargo clippy --lib --bin detect-pdf --bin pdf2md --bin dump_ops --bin probe-formulas --bin probe-formulas-latex -- -D warnings clean
  • probe-formulas-latex binary added for manual quality inspection (compares raw-text vs LaTeX side-by-side with penalty breakdown per formula)

…ntary fonts (0.7.4) (#41)

* fix(detector): correct false flags for CID-encoded text and supplementary fonts (0.7.4)

The page classifier was over-aggressively flagging Mixed-PDF pages as
needing OCR in three distinct cases. Each is fixed at the root in
analyze_page_content / page_has_identity_h_no_tounicode / the
looks_like_scan check.

1. has_vector_text false positives on dense layouts
   path_ops > text_ops*200 fired on pages with decorative paths
   (column borders, dividers) alongside real selectable text. Added
   a unique_alphanum_chars < 30 guard: real outlined-text pages have
   very few unique alphanum chars (each glyph is a path), while
   pages with real text + decorations have many.

2. Identity-H without ToUnicode flagged whole pages on supplementary fonts
   page_has_identity_h_no_tounicode would flag a page if any single
   Type0 font lacked ToUnicode and had no fallback CMap, even when
   the page's actual text came from other decodable fonts (Type1
   with ToUnicode, etc.). Rewrote to track both undecodable
   Identity-H fonts AND other decodable fonts, only flagging when
   no decodable text font is present.

3. CID-encoded text with ToUnicode misclassified as scan
   looks_like_scan checked unique_alphanum_chars < 10 on raw string
   operand bytes. CID-encoded fonts (Type0 with ToUnicode) emit
   2-byte CID values that aren't ASCII alphanum, so the metric is
   blind to them even when the text is fully decodable. Added a
   has_decodable_text_fonts signal: when a page has decodable fonts
   AND >= 10 text ops, the low alphanum count is treated as a CID
   encoding artifact rather than evidence of a scan.

Validated against a broad PDF corpus:
- 6 known false-positive pages now correctly classified as text
- 22 previously-missed scan pages (cover/blank/photo) now correctly
  flagged for OCR
- 0 regressions on truly-scanned PDFs (61/61 pages stay flagged)
- All 437 existing tests pass; clippy clean

Bumps NAPI package to 0.7.4.

* test(detector): add unit tests for the three classifier fixes

Adds 10 unit tests covering the heuristic changes:

- has_vector_text alphanum guard
  - real text + decorative paths → not flagged
  - true outlined glyphs (low alphanum) → still flagged

- page_has_identity_h_no_tounicode supplementary-font handling
  - undecodable Identity-H + decodable Type1 → not flagged (new)
  - undecodable Identity-H alone → still flagged (regression)

- page_has_decodable_text_fonts (new helper)
  - Type1 → true
  - Type0 with ToUnicode → true
  - undecodable Identity-H only → false

- looks_like_scan with has_decodable_text_fonts override
  - CID-encoded decodable text → not flagged as scan
  - same metrics with no decodable fonts → still flagged
  - decodable fonts but text_ops < 10 (page-number overlay) → still flagged

* fix(detector): make decodable-font checks usage-based and XObject-aware

Addresses two reviewer concerns on the previous heuristic fix:

P1 — resource-based check could create an inverse bug
  page_has_identity_h_no_tounicode and page_has_decodable_text_fonts
  iterated all fonts in the page Resources dict, including unused fonts.
  A page whose actual text was rendered exclusively in an undecodable
  Identity-H font but whose Resources also listed an unused decodable
  Type1 would be wrongly unflagged.

  Fix: parse Tf operator operands during content stream scanning to
  collect the set of font names actually referenced. The font checks
  now filter to only USED fonts via a new used_fonts_have_*
  family of functions operating on (used_font_names, font_map).

P2 — checks didn't follow text into Form XObjects
  analyze_page_content correctly recurses through Form XObjects via
  scan_xobjects_in_resources, but the font checks only looked at the
  page's top-level Resources/Font. Pages that render text through Form
  XObjects (corporate templates, header/footer overlays) had their
  XObject font resources missed entirely.

  Fix: scan_xobjects_in_resources now propagates the used_font_names
  set AND collects fonts from each Form XObject's own Resources into
  the shared font_map. The usage-based check sees the full picture:
  page-level fonts + every nested XObject's fonts, intersected with
  fonts actually referenced by Tf operators anywhere in the content.

Implementation:
- New extract_font_name_before_tf helper (parses /Name immediately
  preceding Tf).
- New FontInfo struct caches font properties per-name.
- New collect_fonts_from_resource_dict + new used_fonts_have_*
  functions are pure filters over (used_names, font_map).
- analyze_page_content threads used_font_names + font_map through
  page content scan and XObject recursion, then runs the new checks.
- Old resource-based functions kept as #[cfg(test)] for the existing
  unit-test interface.
- Phase 3 uncached-page loop now goes through analyze_page_content
  so it also gets the usage-based + XObject-aware behavior.

Tests added (8):
  - extract_font_name_before_tf basic + long-name parsing
  - scan_content_for_text_operators collects used font names
  - P1 — unused decodable font in Resources doesn't save a page
    whose used font is undecodable
  - P1 — both fonts used → decodable font correctly prevents flag
  - P2 — decodable font inside Form XObject correctly unflags
  - P2 — undecodable font only in XObject still flags even with
    unused decodable font at page level
  - P2 — has_decodable_text_fonts populated from XObject fonts

Validation:
- 349 lib + 104 integration + 2 doc tests pass (was 341)
- cargo clippy --lib --bin detect-pdf -- -D warnings: clean
- External eval: 9/9 PDFs pass, 6/6 false positives resolved,
  0 regressions, 61/61 scanned pages still correctly flagged
- No eval delta — confirms previous fix wasn't relying on the
  resource-based bug for any of the eval PDFs

* fix(detector): scope font lookups by ObjectId + handle indirect Form Resources

Addresses two more reviewer findings on the previous decodable-font commit.

P1 — Resource-name scoping bug
  The previous fix keyed used_font_names and font_map by raw resource
  names like b"F1". PDF resource names are scoped to each resource
  dictionary: a Form XObject can legally define its own /F1 that points
  to a completely different font from the page's /F1. Because
  collect_fonts_from_resource_dict skipped duplicates with
  `if font_map.contains_key(name)`, the first definition won and later
  Tf /F1 usages in different scopes resolved against the wrong font.
  This could reintroduce both the undecodable-Identity-H false flag
  and the decodable-CID false unflag depending on which side of the
  collision happened to be inserted first.

  Fix: switch the lookup mechanism from font names to font ObjectIds.
    - font_map: HashMap<ObjectId, FontInfo>  (was Vec<u8> keys)
    - used_font_ids: HashSet<ObjectId>       (was Vec<u8> names)
    - new resolve_font_names_to_ids() runs immediately after each
      content scan, against the resource dict in scope, to translate
      the per-scope name set into ObjectIds.
  Each Form XObject's content stream now resolves /F1 against THAT
  XObject's own Resources, so name collisions are impossible by design.
  Inline (no-ID) font dicts are skipped — extremely rare in practice
  and have no stable key.

P2 — Indirect Form /Resources skipped
  scan_xobjects_in_resources used `.as_dict()` on the Form's /Resources
  entry, which returns None for indirect references. PDFs frequently
  store /Resources as `X 0 R`, in which case font collection and
  recursion were both skipped — even though the Tf usages inside the
  XObject content had already been recorded.

  Fix: handle Object::Reference(r) in addition to Object::Dictionary(d)
  by resolving via doc.get_dictionary. Audited the rest of the file —
  the other /Resources access points (analyze_page_images,
  collect_images_from_resources) already handled both cases.

Tests added (4):
  - P1 same-name-different-font (page undecodable, XObject decodable):
    must NOT flag — XObject's text is decodable in its own scope.
  - P1 inverse (page decodable, XObject undecodable, content uses
    XObject /F1): MUST flag — undecodable text exists in real scope.
  - P2 indirect Form /Resources: font discovery must still work when
    /Resources is a `X 0 R` reference rather than inline.
  - Combined regression: indirect Resources + name collision.

Validation:
  - cargo test --release: 459 tests pass (353 lib + 104 integration + 2 doc)
  - cargo clippy --lib --bin detect-pdf -- -D warnings: clean
  - external eval (9 PDFs): 9/9 pass, 6/6 false positives resolved,
    0 regressions, 61/61 truly-scanned pages still flagged

The behavior on the eval set is identical — confirms the correctness
fix isn't masking any change in classifier outcomes.

* fix(detector): respect resource shadowing when resolving page-content fonts

The previous ObjectId-based fix correctly scoped Form XObject fonts
but still violated PDF resource inheritance for page content. When a
page overrides /F1 from a parent /Pages node (different font dict for
the same name), get_page_resources returns the page's own /Resources
plus all ancestor /Resources dicts. The old code called
resolve_font_names_to_ids on each one and added every match to
used_font_ids — both font ObjectIds ended up in the used set even
though only the page's /F1 is actually visible to that page's content.

Per ISO 32000-1 §7.7.3.4, resource names are inherited with
shadowing semantics: the most-specific (deepest, closest to the page)
definition wins.

Fix:
- New lookup_font_id helper resolves a single name in a single dict.
- New resolve_with_shadowing iterates names, checking the page's own
  /Resources first, then walking ancestors in most-specific-first
  order (which is the order lopdf's get_page_resources returns).
  First hit wins via a labeled `continue 'name` — subsequent
  ancestors are skipped for that name.
- analyze_page_content's flat resolution loop replaced with one call
  to resolve_with_shadowing.

Audit:
- XObject path is correct: each Form XObject already resolves names
  against its OWN /Resources (XObjects don't inherit from page tree).
- font_map population is correct: keyed by ObjectId, so collecting
  from all dicts builds the full available-fonts catalog. The bug
  was only in the used-set resolution.
- Confirmed lopdf returns ancestors in most-specific-first order
  (page → parent → grandparent → root), matching the shadowing
  direction used here.

Tests added (3):
  - page /F1 undecodable shadows parent's decodable /F1 → MUST flag
  - page /F1 decodable shadows parent's undecodable /F1 → MUST NOT flag
  - no override: page inherits parent's decodable /F1 → MUST NOT flag

Validation:
  - cargo test --release: 462 tests pass (356 lib + 104 integration + 2 doc)
  - cargo clippy --lib --bin detect-pdf -- -D warnings: clean
  - external eval: 9/9, 0 regressions, 6/6 false positives resolved,
    61/61 scanned pages still correctly flagged
Adds extract_formulas_in_regions_as_latex (and NAPI export
extractFormulasInRegionsAsLatex) — converts the linearized text from
formula bboxes into LaTeX using positioned text data, with a
calibrated confidence score so callers can gate on quality.

Pipeline (per formula bbox):
  1. Pull positioned text items inside the bbox via the existing
     positioned-text extractor
  2. Apply unicode → LaTeX char mapping (160+ symbols: Greek,
     operators, relations) from a dedicated unicode_map module
  3. Detect simple structure from item geometry — sub/superscripts
     by font-size + y-baseline, basic two-row fractions
  4. Score the result with positive points for clean conversion,
     plus penalties for failure modes that produce broken output

Confidence is honest, not optimistic. Penalties applied:
  - many items (>15) and very many (>25)
  - 3+ distinct y-bands (multi-row display equations)
  - fraction fired but denominator x-range much wider than
    numerator (cross-equation false positive)
  - fraction fired but denominator starts well to the left
    (likely separate expression below)
  - large operator (∫ ∑ ∏ √ etc) bigger than 1.3× median —
    these need bounded-operator structure detection (Phase 2)
  - mixed font sizes within a single y-band

The high-confidence band (>0.85) on a formula-heavy academic test
PDF dropped from 81% → 38% after recalibration. Manual inspection
confirms the new high-confidence band contains structurally-correct
LaTeX only — no false positives. Mid (0.5-0.85) and low (<0.5)
bands hold the cases where structural reconstruction is uncertain
or broken; callers should fall back to OCR for those.

Also includes:
  - probe-formulas-latex eval binary (compares raw text vs LaTeX
    side-by-side with confidence breakdown for quality inspection)
  - probe-formulas eval binary (validates raw extract API)
  - 8 new unit tests for penalty calculation
  - 15 + 10 unit tests for reconstruction and unicode mapping

Bumps NAPI package to 0.7.5. Builds on extractFormulasInRegions
from the previous formula-extraction feature.
The probe-formulas and probe-formulas-latex binaries were used to
validate the LaTeX recovery during development but are not needed
by consumers of the library. Removing them also drops the serde
runtime dependency, which was only required by the probes.

The probes can live as standalone scripts outside the published
crate.

No production behavior change. All 393 lib + 104 integration + 2 doc
tests still pass; clippy clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant