Skip to content

fix(analyzer): restore Features fast path and add ComputeDistanceAndSimilarity (PR A, fixes #315)#453

Open
CatfishGG wants to merge 7 commits into
ludo-technologies:mainfrom
CatfishGG:fix/tfidf-clone-detection-v2
Open

fix(analyzer): restore Features fast path and add ComputeDistanceAndSimilarity (PR A, fixes #315)#453
CatfishGG wants to merge 7 commits into
ludo-technologies:mainfrom
CatfishGG:fix/tfidf-clone-detection-v2

Conversation

@CatfishGG

Copy link
Copy Markdown

This is PR A of the split proposed in the original PR #452 (comment 4487566576).

Yoda findings addressed (#3 and #4)

Finding #4: Features fast path lost
SyntacticSimilarityAnalyzer.ComputeSimilarity always re-extracted features via tree traversal, even when both fragments already had pre-computed Features populated in prepareFragments. Restored the short-circuit that uses jaccardSimilarity(f1.Features, f2.Features) directly when features are available.

Finding #3: APTED runs twice per pair
compareFragmentsWithClassifier and compareWithAPTED called ComputeDistance and ComputeSimilarity separately. Since ComputeSimilarityTrees internally calls ComputeDistanceTrees, every fragment pair paid 2x APTED cost. Added ComputeDistanceAndSimilarity to the SimilarityAnalyzer interface and updated both methods to use a single combined call.

Changes

  • Restore Features fast path in SyntacticSimilarityAnalyzer.ComputeSimilarity
  • Add ComputeDistanceAndSimilarity(CodeFragment) to SimilarityAnalyzer interface
  • Implement for all analyzers: APTED (single-traversal Trees version), Structural, Syntactic, Textual, Semantic
  • Update clone_detector.go compareFragmentsWithClassifier and compareWithAPTED to use the combined method

NOT addressed (deferred to PR B)

Testing

All tests pass. go vet and gofmt clean.

cc @DaisukeYoda

@CatfishGG

Copy link
Copy Markdown
Author

Hey @DaisukeYoda, this is PR A addressing findings #3 and #4 from your review comment. Finding #4 (Features fast path) is restored with a targeted commit. Finding #3 (double APTED run) is fixed by adding ComputeDistanceAndSimilarity to the SimilarityAnalyzer interface and using it in the two comparison methods. Findings #1 and #2 (TF-IDF wiring) are deferred to PR B.

@DaisukeYoda DaisukeYoda self-requested a review May 20, 2026 07:55

@DaisukeYoda DaisukeYoda left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CatfishGG

This PR accidentally reverts the source extraction perf optimization from PR #438 (commit 8655810). Bench impact is roughly 200x slowdown (30µs → 6ms/op).

You can re-apply just the affected portion with:

git show 8655810 -- internal/analyzer/clone_detector.go | git apply

The rest of the PR looks good, thanks!

@CatfishGG CatfishGG force-pushed the fix/tfidf-clone-detection-v2 branch from 93f40c4 to 0137a8b Compare May 21, 2026 10:06
@CatfishGG

Copy link
Copy Markdown
Author

@DaisukeYoda I forgot to ping you, but the revert has been reverted!

@DaisukeYoda

Copy link
Copy Markdown
Member

@CatfishGG Thanks! Please resolve conflict and then I can approve this PR.

CatfishGG pushed a commit to CatfishGG/pyscn that referenced this pull request May 21, 2026
… add ComputeDistanceAndSimilarity, fix apted.go conflict with ComputeDistanceTrees/SimilarityTrees naming
@CatfishGG CatfishGG force-pushed the fix/tfidf-clone-detection-v2 branch from 60821e8 to 3b239c3 Compare May 28, 2026 12:00
minitester added 7 commits May 28, 2026 17:33
Covers: NewTFIDFCalculator, IDF, ComputeIDF (cache hit/miss), ToWeightedVector,
CosineSimilarity (empty, identical, orthogonal, partial overlap, zero norm)
…ble APTED runs

- Add ComputeDistanceAndSimilarityTrees to APTEDAnalyzer for single-traversal
  distance+similarity on TreeNodes
- Add ComputeDistanceAndSimilarity(CodeFragment) to SimilarityAnalyzer interface
- Implement for all analyzers: APTED, Semantic, Structural, Syntactic, Textual
- Update compareFragmentsWithClassifier and compareWithAPTED to use the
  combined method instead of separate ComputeDistance + ComputeSimilarity calls
- Eliminates 2x APTED tree-edit-distance computation per fragment pair

Fixes regression introduced in PR ludo-technologies#452 where every clone pair paid double
APTED cost due to separate ComputeDistance and ComputeSimilarity calls.
@CatfishGG CatfishGG force-pushed the fix/tfidf-clone-detection-v2 branch from 3b239c3 to 4eb7344 Compare May 28, 2026 12:07
@CatfishGG

CatfishGG commented May 28, 2026

Copy link
Copy Markdown
Author

Hey Yoda, the conflict is resolved. Rebased onto main and fixed the duplicate method declarations. All tests green.

Also sorry for the new PR #491 I created earlier, that was a mishap on my end. My bad 😭.

Ready for your review whenever you get a chance.

@CatfishGG CatfishGG requested a review from DaisukeYoda May 28, 2026 13:57
@DaisukeYoda

Copy link
Copy Markdown
Member

@CatfishGG

Thanks for splitting this up! The main fix here looks good.
But this PR still has all the TF-IDF work in it. We agreed that goes in PR B, not this one. So this PR isn't fully split yet.
Could you take out everything related to TF-IDF and move it to PR B? Once this PR has only the fix we talked about (and none of the TF-IDF stuff), it'll be ready.
Let me know if you want help figuring out which parts are TF-IDF.

P.S. No worries at all about PR491. I don't mind.

@CatfishGG

Copy link
Copy Markdown
Author

Hey @DaisukeYoda, sorry about the mess. PR A is now clean - everything TF-IDF is stripped out and lives in PR B instead: #492

Thanks for your patience through all that confusion.

@DaisukeYoda

Copy link
Copy Markdown
Member

@CatfishGG
Thanks, the TF-IDF removal looks great now and the split is clean.

But there's a problem: this branch is also deleting a bunch of unrelated APTED code (the large-tree optimization and its tests). It looks like your branch was started from an older version of main, so rebasing brought back the old code and wiped out a perf fix we merged recently.

Could you rebase your branch onto the latest main and double-check that the APTED large-tree code is still there?

@CatfishGG CatfishGG force-pushed the fix/tfidf-clone-detection-v2 branch from b7fb58f to 4eb7344 Compare May 30, 2026 13:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants