Dataset similarity #122

JochenSiegWork · 2025-02-10T13:38:38Z

Add new class that performs k-nearest neighbor searches using
Tanimoto similarity. The implementation uses sparse dot product
making the algorithm 2-3x faster than RDKit's BulkTanimotoSimilarity
Add notebook illustrating NearestNeighborsRetrieverTanimoto for
dataset similarity analysis, like train/test set comaparison.

Also addresses #117

- Add new class that performs k-nearest neighbor searches using Tanimoto similarity. The implementation uses sparse dot product making the algorithm 2-3x faster than RDKit's BulkTanimotoSimilarity - Add notebook illustrating NearestNeighborsRetrieverTanimoto for dataset similarity analysis, like train/test set comaparison.

…arity

c-w-feldmann

In addition as discussed: Make it an estimator

c-w-feldmann · 2025-02-12T12:13:02Z

molpipeline/estimators/nearest_neighbor.py

+        else:
+            self.k = k
+        self.batch_size = batch_size
+        if n_jobs == -1:


Maybe use this function instead?
https://github.com/basf/MolPipeline/blob/main/molpipeline/utils/multi_proc.py#L9

JochenSiegWork · 2025-03-14T12:30:17Z

In addition to the dot-product Tanimoto, we could also check out if its possible to add an implementation of iSim https://github.com/mqcomplab/bitbirch/blob/main/bitbirch.py

JochenSiegWork added 2 commits February 10, 2025 14:36

Merge branch 'main' of github.com:basf/MolPipeline into dataset_simil…

4832639

…arity

JochenSiegWork requested a review from c-w-feldmann February 10, 2025 13:38

JochenSiegWork self-assigned this Feb 10, 2025

JochenSiegWork added 3 commits February 10, 2025 14:40

docsig

b0df02a

finished notebook execution

21f37e6

add chembl_35_20k.smi.gz example data file

4a3700f

c-w-feldmann requested changes Feb 13, 2025

View reviewed changes

first part Christians comments

753c9ab

Merge branch 'main' into dataset_similarity

25d6e09

c-w-feldmann marked this pull request as draft April 25, 2025 14:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dataset similarity #122

Dataset similarity #122

Uh oh!

JochenSiegWork commented Feb 10, 2025 •

edited

Loading

Uh oh!

c-w-feldmann left a comment

Uh oh!

c-w-feldmann Feb 12, 2025

Uh oh!

JochenSiegWork commented Mar 14, 2025

Uh oh!

Uh oh!

Dataset similarity #122

Are you sure you want to change the base?

Dataset similarity #122

Uh oh!

Conversation

JochenSiegWork commented Feb 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

c-w-feldmann left a comment

Choose a reason for hiding this comment

Uh oh!

c-w-feldmann Feb 12, 2025

Choose a reason for hiding this comment

Uh oh!

JochenSiegWork commented Mar 14, 2025

Uh oh!

Uh oh!

JochenSiegWork commented Feb 10, 2025 •

edited

Loading