-
Notifications
You must be signed in to change notification settings - Fork 100
Speedup for molecule isomorphism checking #2114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
for more information, see https://pre-commit.ci
Codecov Report❌ Patch coverage is 🚀 New features to boost your workflow:
|
| for node in data: | ||
| h_counter = -1 | ||
| for neighbor in data.neighbors(node): | ||
| if data.nodes[neighbor]['atomic_number'] == 1: | ||
| data.nodes[neighbor]['atomic_number'] = h_counter | ||
| h_counter -= 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this depend on the order of the data.neighbors(node) iterator (in problematic ways)? I will try to think of an antagonistic test case tomorrow, if possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is an absolutely filthy hack but I've been unable to think of edge cases that would trip it up. The order of the neighbors shouldn't matter since they're all chemically equivalent hydrogens, and we specify stereochemistry on the central atom using CIP rules which should be invariant to the ordering of the bonds/neighbors (as opposed to tying it to the ordering of bonds, where losing track of the Hs' identities could MAYBE scramble things in some complex way).
I'd love any other ideas you come up with, even if they're just smells that you can't solidify into test cases. This is for sure cooked if we ever get a H with two bonds or try to handle deuterium or something.
|
So far I'm seeing a net improvement in some The two environments I used are my Interchange development environment (with this branch installed) and a second with only released versions of our software2
Footnotes |
|
A quick update here - I'm offline for the rest of this week and half of the next, but my plan is to come back around the 27th and try to prove to myself that this is a safe approach and put in tests to raise the alarm if a later change violates some assumption that this needs (no divalent H, no H isotopes, no negative atomic numbers, etc). If folks want to help out by thinking of edge cases/regressions, that could help me get through the QA process faster. |
|
Some results using the faster_isomorphism branch with the highly polydisperse systems that were the use case in issue 1156 (trying to link this properly but it will only let me link the PR with that number). The faster isomorphism branch has consistently lower create_interchange() runtimes compared to using off-toolkit v.0.17.0 when number of unique polymer chain components in the topology is increased. Unclear to me what was causing that huge jump at >40 unique polymer chains, but it appears to have been fixed by the changes in faster_isomorphism. For comparison, here is where I started when this issue was opened (toolkit v.0.16.8) I can post some reproducing code for this benchmark if needed. |
|
Thanks @hannaomi - those results look great! The discontinuity is indeed confusing but in either case the scaling appears to remain linear and the performance is improved. Outside the context of this PR and your benchmarks, we have a couple of other patches in Interchange in the pipelines which might help this even further - both creation and export. Based on some benchmarking I've done, I suspect most of the runtime is Interchange and not the toolkit - but that's not worth a deep dive right now, I don't think. |
…py input, polish docstrings


Topology.identical_molecule_groupsis too slow #2035