fix(analyzer): restore Features fast path and add ComputeDistanceAndSimilarity (PR A, fixes #315)#453
Conversation
|
Hey @DaisukeYoda, this is PR A addressing findings #3 and #4 from your review comment. Finding #4 (Features fast path) is restored with a targeted commit. Finding #3 (double APTED run) is fixed by adding ComputeDistanceAndSimilarity to the SimilarityAnalyzer interface and using it in the two comparison methods. Findings #1 and #2 (TF-IDF wiring) are deferred to PR B. |
DaisukeYoda
left a comment
There was a problem hiding this comment.
This PR accidentally reverts the source extraction perf optimization from PR #438 (commit 8655810). Bench impact is roughly 200x slowdown (30µs → 6ms/op).
You can re-apply just the affected portion with:
git show 8655810 -- internal/analyzer/clone_detector.go | git apply
The rest of the PR looks good, thanks!
93f40c4 to
0137a8b
Compare
|
@DaisukeYoda I forgot to ping you, but the revert has been reverted! |
|
@CatfishGG Thanks! Please resolve conflict and then I can approve this PR. |
… add ComputeDistanceAndSimilarity, fix apted.go conflict with ComputeDistanceTrees/SimilarityTrees naming
60821e8 to
3b239c3
Compare
Covers: NewTFIDFCalculator, IDF, ComputeIDF (cache hit/miss), ToWeightedVector, CosineSimilarity (empty, identical, orthogonal, partial overlap, zero norm)
…ble APTED runs - Add ComputeDistanceAndSimilarityTrees to APTEDAnalyzer for single-traversal distance+similarity on TreeNodes - Add ComputeDistanceAndSimilarity(CodeFragment) to SimilarityAnalyzer interface - Implement for all analyzers: APTED, Semantic, Structural, Syntactic, Textual - Update compareFragmentsWithClassifier and compareWithAPTED to use the combined method instead of separate ComputeDistance + ComputeSimilarity calls - Eliminates 2x APTED tree-edit-distance computation per fragment pair Fixes regression introduced in PR ludo-technologies#452 where every clone pair paid double APTED cost due to separate ComputeDistance and ComputeSimilarity calls.
3b239c3 to
4eb7344
Compare
|
Hey Yoda, the conflict is resolved. Rebased onto main and fixed the duplicate method declarations. All tests green. Also sorry for the new PR #491 I created earlier, that was a mishap on my end. My bad 😭. Ready for your review whenever you get a chance. |
|
Thanks for splitting this up! The main fix here looks good. P.S. No worries at all about PR491. I don't mind. |
4eb7344 to
b7fb58f
Compare
|
Hey @DaisukeYoda, sorry about the mess. PR A is now clean - everything TF-IDF is stripped out and lives in PR B instead: #492 Thanks for your patience through all that confusion. |
|
@CatfishGG But there's a problem: this branch is also deleting a bunch of unrelated APTED code (the large-tree optimization and its tests). It looks like your branch was started from an older version of Could you rebase your branch onto the latest |
b7fb58f to
4eb7344
Compare
This is PR A of the split proposed in the original PR #452 (comment 4487566576).
Yoda findings addressed (#3 and #4)
Finding #4: Features fast path lost
SyntacticSimilarityAnalyzer.ComputeSimilarity always re-extracted features via tree traversal, even when both fragments already had pre-computed Features populated in prepareFragments. Restored the short-circuit that uses jaccardSimilarity(f1.Features, f2.Features) directly when features are available.
Finding #3: APTED runs twice per pair
compareFragmentsWithClassifier and compareWithAPTED called ComputeDistance and ComputeSimilarity separately. Since ComputeSimilarityTrees internally calls ComputeDistanceTrees, every fragment pair paid 2x APTED cost. Added ComputeDistanceAndSimilarity to the SimilarityAnalyzer interface and updated both methods to use a single combined call.
Changes
NOT addressed (deferred to PR B)
Testing
All tests pass. go vet and gofmt clean.
cc @DaisukeYoda