Skip to content

perf: replace difflib with RapidFuzz for fuzzy alignment + interval tree merge#387

Open
IgnatG wants to merge 1 commit intogoogle:mainfrom
IgnatG:feat/rapidfuzz
Open

perf: replace difflib with RapidFuzz for fuzzy alignment + interval tree merge#387
IgnatG wants to merge 1 commit intogoogle:mainfrom
IgnatG:feat/rapidfuzz

Conversation

@IgnatG
Copy link

@IgnatG IgnatG commented Feb 20, 2026

Description

Replace difflib.SequenceMatcher with RapidFuzz (C++ backend) for fuzzy string alignment in resolver.py, and add a hybrid interval-tree merge path in annotation.py for large extraction sets.

Changes:

  • resolver.py: Two-phase RapidFuzz alignment — Phase 1 uses fuzz.partial_ratio_alignment() for character-level matching with ±3 token refinement via Indel.distance(); Phase 2 falls back to a sliding-window approach with Counter pre-checks and Indel.distance() scoring.
  • annotation.py: Hybrid merge strategy — flat O(n) scan for <200 extractions, IntervalTree with O(log n) overlap queries for ≥200.
  • pyproject.toml: Added rapidfuzz>=3.14.3 and intervaltree>=3.2.1 as dependencies.

Benchmark results (OpenAI gpt-4o-mini, same text/prompt, Windows):

Branch Runs Avg Wall-Time Entities Found
main (difflib) 2 10.54 s 7
feat/rapidfuzz 4 6.83 s 7

~35% faster (~3.7 s reduction), entirely in CPU-side alignment — extraction quality unchanged.

Fixes #386

Feature

How Has This Been Tested?

All 79 existing resolver tests pass without modification.

End-to-end benchmarks were also run on both branches using with OpenAI gpt-4o-mini to confirm no regression in extraction quality and to measure wall-time improvement:

Branch Runs Avg Wall-Time Entities Found
main (difflib) 2 10.54 s 7
feat/rapidfuzz 4 6.83 s 7

Entity count and extraction quality are identical across branches; the ~35% wall-time reduction is entirely in CPU-side alignment.

Checklist:

  • I have read and acknowledged Google's Open Source Code of conduct.
  • I have read the
    Contributing page, and I either signed the Google
    Individual CLA or am covered by my company's
    Corporate CLA.
  • I have discussed my proposed solution with code owners in the linked issue(s) and we have agreed upon the general approach.
  • I have made any needed documentation changes, or noted in the linked issue(s) that documentation elsewhere needs updating.
  • I have added tests, or I have ensured existing tests cover the changes
  • I have followed Google's Python Style Guide and ran pylint over the affected code.

@github-actions github-actions bot added the size/XL Pull request with over 1000 lines changed - too large label Feb 20, 2026
@github-actions
Copy link

Infrastructure File Protection

This PR modifies protected infrastructure files:

  • pyproject.toml (2 changes)

Only repository maintainers are allowed to modify infrastructure files (including .github/, build configuration, and repository documentation).

Note: If these are only formatting changes, please:

  1. Revert changes to .github/ files
  2. Use ./autoformat.sh to format only source code directories
  3. Avoid running formatters on infrastructure files

If structural changes are necessary:

  1. Open an issue describing the needed infrastructure changes
  2. A maintainer will review and implement the changes if approved

For more information, see our Contributing Guidelines.

@google-cla
Copy link

google-cla bot commented Feb 20, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@github-actions
Copy link

Infrastructure File Protection

This PR modifies protected infrastructure files:

  • pyproject.toml (2 changes)

Only repository maintainers are allowed to modify infrastructure files (including .github/, build configuration, and repository documentation).

Note: If these are only formatting changes, please:

  1. Revert changes to .github/ files
  2. Use ./autoformat.sh to format only source code directories
  3. Avoid running formatters on infrastructure files

If structural changes are necessary:

  1. Open an issue describing the needed infrastructure changes
  2. A maintainer will review and implement the changes if approved

For more information, see our Contributing Guidelines.

@github-actions github-actions bot added size/M Pull request with 150-600 lines changed and removed size/XL Pull request with over 1000 lines changed - too large labels Feb 20, 2026
@github-actions
Copy link

Infrastructure File Protection

This PR modifies protected infrastructure files:

  • pyproject.toml (2 changes)

Only repository maintainers are allowed to modify infrastructure files (including .github/, build configuration, and repository documentation).

Note: If these are only formatting changes, please:

  1. Revert changes to .github/ files
  2. Use ./autoformat.sh to format only source code directories
  3. Avoid running formatters on infrastructure files

If structural changes are necessary:

  1. Open an issue describing the needed infrastructure changes
  2. A maintainer will review and implement the changes if approved

For more information, see our Contributing Guidelines.

1 similar comment
@github-actions
Copy link

Infrastructure File Protection

This PR modifies protected infrastructure files:

  • pyproject.toml (2 changes)

Only repository maintainers are allowed to modify infrastructure files (including .github/, build configuration, and repository documentation).

Note: If these are only formatting changes, please:

  1. Revert changes to .github/ files
  2. Use ./autoformat.sh to format only source code directories
  3. Avoid running formatters on infrastructure files

If structural changes are necessary:

  1. Open an issue describing the needed infrastructure changes
  2. A maintainer will review and implement the changes if approved

For more information, see our Contributing Guidelines.

@IgnatG IgnatG changed the title perf: Replace difflib with RapidFuzz for fuzzy alignment + interval tree merge perf: replace difflib with RapidFuzz for fuzzy alignment + interval tree merge Feb 20, 2026
@github-actions
Copy link

Infrastructure File Protection

This PR modifies protected infrastructure files:

  • pyproject.toml (2 changes)

Only repository maintainers are allowed to modify infrastructure files (including .github/, build configuration, and repository documentation).

Note: If these are only formatting changes, please:

  1. Revert changes to .github/ files
  2. Use ./autoformat.sh to format only source code directories
  3. Avoid running formatters on infrastructure files

If structural changes are necessary:

  1. Open an issue describing the needed infrastructure changes
  2. A maintainer will review and implement the changes if approved

For more information, see our Contributing Guidelines.

@github-actions
Copy link

Infrastructure File Protection

This PR modifies protected infrastructure files:

  • pyproject.toml (2 changes)

Only repository maintainers are allowed to modify infrastructure files (including .github/, build configuration, and repository documentation).

Note: If these are only formatting changes, please:

  1. Revert changes to .github/ files
  2. Use ./autoformat.sh to format only source code directories
  3. Avoid running formatters on infrastructure files

If structural changes are necessary:

  1. Open an issue describing the needed infrastructure changes
  2. A maintainer will review and implement the changes if approved

For more information, see our Contributing Guidelines.

@github-actions
Copy link

⚠️ Branch Update Required

Your branch is 1 commits behind main. Please update your branch to ensure CI checks run with the latest code:

git fetch origin main
git merge origin/main
git push

Note: Enable "Allow edits by maintainers" to allow automatic updates.

2 similar comments
@github-actions
Copy link

github-actions bot commented Mar 5, 2026

⚠️ Branch Update Required

Your branch is 1 commits behind main. Please update your branch to ensure CI checks run with the latest code:

git fetch origin main
git merge origin/main
git push

Note: Enable "Allow edits by maintainers" to allow automatic updates.

@github-actions
Copy link

⚠️ Branch Update Required

Your branch is 1 commits behind main. Please update your branch to ensure CI checks run with the latest code:

git fetch origin main
git merge origin/main
git push

Note: Enable "Allow edits by maintainers" to allow automatic updates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M Pull request with 150-600 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Performance: Replace difflib with RapidFuzz for fuzzy alignment and add interval tree for merge

1 participant