Skip to content

fix(analyzer): restore Features fast path and add ComputeDistanceAndSimilarity (PR A, fixes #315)#493

Closed
CatfishGG wants to merge 7 commits into
ludo-technologies:mainfrom
CatfishGG:fix/clone-detector-cleanup
Closed

fix(analyzer): restore Features fast path and add ComputeDistanceAndSimilarity (PR A, fixes #315)#493
CatfishGG wants to merge 7 commits into
ludo-technologies:mainfrom
CatfishGG:fix/clone-detector-cleanup

Conversation

@CatfishGG

Copy link
Copy Markdown

PR A: Non-TF-IDF changes from the combined PR #453.

Includes:

  • Features fast path in ComputeSimilarity (fixes regression from Speed up analysis pipeline #438)
  • ComputeDistanceAndSimilarity to eliminate double APTED runs
  • splitLines source extraction optimization

TF-IDF implementation will come in PR B.

minitester added 7 commits May 19, 2026 11:22
…fixes ludo-technologies#315)

- Add TFIDFCalculator type with ComputeIDF, IDF, ToWeightedVector, CosineSimilarity
- Update SimilarityAnalyzer interface: ComputeSimilarity now takes *TFIDFCalculator param
- Rename APTED tree-based methods to ComputeDistanceTrees/ComputeSimilarityTrees
- Add CodeFragment-based ComputeSimilarity/ComputeDistance to APTEDAnalyzer (interface impl)
- Add UseTFIDF config flag to CloneDetectorConfig
- SyntacticSimilarityAnalyzer uses TF-IDF weighted cosine similarity when calc provided
- Update CloneClassifier.ClassifyClone to accept *TFIDFCalculator for TF-IDF aware Type-2
- Fix all test files to use correct method signatures
- Fix LSH integration to use integer IDs (compatible with upstream lsh_index.go)

Cherry-picked from fix/tfidf-target with conflict resolution for LSH integer API
Covers: NewTFIDFCalculator, IDF, ComputeIDF (cache hit/miss), ToWeightedVector,
CosineSimilarity (empty, identical, orthogonal, partial overlap, zero norm)
…ble APTED runs

- Add ComputeDistanceAndSimilarityTrees to APTEDAnalyzer for single-traversal
  distance+similarity on TreeNodes
- Add ComputeDistanceAndSimilarity(CodeFragment) to SimilarityAnalyzer interface
- Implement for all analyzers: APTED, Semantic, Structural, Syntactic, Textual
- Update compareFragmentsWithClassifier and compareWithAPTED to use the
  combined method instead of separate ComputeDistance + ComputeSimilarity calls
- Eliminates 2x APTED tree-edit-distance computation per fragment pair

Fixes regression introduced in PR ludo-technologies#452 where every clone pair paid double
APTED cost due to separate ComputeDistance and ComputeSimilarity calls.
Yoda review feedback: move TF-IDF implementation to a separate PR (PR B).

This commit reverts the SimilarityAnalyzer interface back to the original
2-argument signature, removes UseTFIDF config flag and tfidfCalculator
field from CloneDetector, and replaces tfidf.go with an empty stub
struct. The full TF-IDF implementation (tfidf.go logic, tfidf_test.go,
UseTFIDF flag, tfidfCalculator, *TFIDFCalculator interface param) stays
in the backup at /tmp/tfidf_backup.go for PR B.

Also removed the TF-IDF block from SyntacticSimilarityAnalyzer.ComputeSimilarity
and reverted to pure Jaccard-based similarity.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant