perf: replace difflib with RapidFuzz for fuzzy alignment + interval tree merge#387
perf: replace difflib with RapidFuzz for fuzzy alignment + interval tree merge#387IgnatG wants to merge 1 commit intogoogle:mainfrom
Conversation
|
❌ Infrastructure File Protection This PR modifies protected infrastructure files:
Only repository maintainers are allowed to modify infrastructure files (including Note: If these are only formatting changes, please:
If structural changes are necessary:
For more information, see our Contributing Guidelines. |
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
|
❌ Infrastructure File Protection This PR modifies protected infrastructure files:
Only repository maintainers are allowed to modify infrastructure files (including Note: If these are only formatting changes, please:
If structural changes are necessary:
For more information, see our Contributing Guidelines. |
|
❌ Infrastructure File Protection This PR modifies protected infrastructure files:
Only repository maintainers are allowed to modify infrastructure files (including Note: If these are only formatting changes, please:
If structural changes are necessary:
For more information, see our Contributing Guidelines. |
1 similar comment
|
❌ Infrastructure File Protection This PR modifies protected infrastructure files:
Only repository maintainers are allowed to modify infrastructure files (including Note: If these are only formatting changes, please:
If structural changes are necessary:
For more information, see our Contributing Guidelines. |
|
❌ Infrastructure File Protection This PR modifies protected infrastructure files:
Only repository maintainers are allowed to modify infrastructure files (including Note: If these are only formatting changes, please:
If structural changes are necessary:
For more information, see our Contributing Guidelines. |
acab577 to
b51ef80
Compare
|
❌ Infrastructure File Protection This PR modifies protected infrastructure files:
Only repository maintainers are allowed to modify infrastructure files (including Note: If these are only formatting changes, please:
If structural changes are necessary:
For more information, see our Contributing Guidelines. |
|
Your branch is 1 commits behind git fetch origin main
git merge origin/main
git pushNote: Enable "Allow edits by maintainers" to allow automatic updates. |
2 similar comments
|
Your branch is 1 commits behind git fetch origin main
git merge origin/main
git pushNote: Enable "Allow edits by maintainers" to allow automatic updates. |
|
Your branch is 1 commits behind git fetch origin main
git merge origin/main
git pushNote: Enable "Allow edits by maintainers" to allow automatic updates. |
Description
Replace
difflib.SequenceMatcherwith RapidFuzz (C++ backend) for fuzzy string alignment inresolver.py, and add a hybrid interval-tree merge path inannotation.pyfor large extraction sets.Changes:
resolver.py: Two-phase RapidFuzz alignment — Phase 1 usesfuzz.partial_ratio_alignment()for character-level matching with ±3 token refinement viaIndel.distance(); Phase 2 falls back to a sliding-window approach with Counter pre-checks andIndel.distance()scoring.annotation.py: Hybrid merge strategy — flat O(n) scan for <200 extractions,IntervalTreewith O(log n) overlap queries for ≥200.pyproject.toml: Addedrapidfuzz>=3.14.3andintervaltree>=3.2.1as dependencies.Benchmark results (OpenAI
gpt-4o-mini, same text/prompt, Windows):main(difflib)feat/rapidfuzz~35% faster (~3.7 s reduction), entirely in CPU-side alignment — extraction quality unchanged.
Fixes #386
Feature
How Has This Been Tested?
All 79 existing resolver tests pass without modification.
End-to-end benchmarks were also run on both branches using with OpenAI
gpt-4o-minito confirm no regression in extraction quality and to measure wall-time improvement:main(difflib)feat/rapidfuzzEntity count and extraction quality are identical across branches; the ~35% wall-time reduction is entirely in CPU-side alignment.
Checklist:
Contributing page, and I either signed the Google
Individual CLA or am covered by my company's
Corporate CLA.
pylintover the affected code.