-
Notifications
You must be signed in to change notification settings - Fork 15
Split dataframe optimization #609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split dataframe optimization #609
Conversation
There was a bit too much iteration going on in `parsers.split_dataframe_by_prefix`-- for every combination of prefixes*relations*objects, the entire input dataframe was being iterated over. This was a lot of overhead, when it only needs to be gone through once. This is basically the same approach used in sssom-java (see note at bottom), but there is some more work being done to keep logging consistent with the previous approach. In testing splitting `mondo.sssom.tsv` (a 25mb file), the time to split went from ~5 minutes to ~50 seconds. Nearly all of the time spent is on calling `parsers.from_sssom_dataframe`, which is slow but inevitable. <https://github.com/gouttegd/sssom-java/blob/main/cli/src/main/java/org/incenp/obofoundry/sssom/cli/SimpleCLI.java#L646-L670>
|
@ptgolden that's an incredible speedup! Nice work! Do you have a link to this mondo.sssom.tsv for my own testing purposes? |
src/sssom/parsers.py
Outdated
|
|
||
|
|
||
| @dataclass(frozen=True, eq=True) | ||
| class SSSOMSplitTriple: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use a named tuple here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely could do! Although that would preclude having methods on the class. (Those methods were for convenience, since they're used twice).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's true. You can still have class methods and instance methods when you subclass from typing.NamedTuple
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, true, a subclass of NamedTuple could add those methods. I am happy to change that- though what is the advantage over a hashable dataclass (i.e. frozen=True) at that point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wow, subclassing NamedTuple is a good deal more ergonomic than it used to be. Change incoming!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just realized I left a useless __post_init__ method in there also.
src/sssom/parsers.py
Outdated
| relation: str | ||
|
|
||
| def __post_init__(self): | ||
| relation_prefix, relation_id = self.relation.split(":") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use the curies package for CURIE parsing! Within the MSDF, you can access the converter directly with msdf.converter and use msdf.converter.parse_curie
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do 👍
Just replied at #608 (comment) |
matentzn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much!
My python reading is a bit rusty.
- I checked the tests and they make sense
- The code is well written. I do not know whether OO (i.e.
SSSOMSplitTriple) is the way things are done still, but I dont see a problem.
@cthoyt do you have any major objections (since this is, or might be, an interim solution and it will help me shave A LOT of time of my existing pipelines).
| f"{self.json_file} has the wrong number of mappings.", | ||
| ) | ||
|
|
||
| def test_split_msdf(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the tests!!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized that the test for split in the CLI only checked if the command ran without error, without checking any output 😵💫
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don’t get me on a rant about the cli testing here…
Renamed the NamedTuple which dictates how mappings should be split, from SSSOMSplitTriple to SSSOMSplitGroup. This tuple consists of the combination of: 1. subject_prefix 2. object_prefix 3. relation_curie This clarifies the purpose of the class-- it may be useful to group instead by only 1 & 2 (this is what the `split` cli option in sssom-java does). Added/clarified documentation.
...just renamed some variables. would have amended my previous commit but I had already pushed up.
|
Thanks, @cthoyt. I've refactored a bit and I think it's a good deal better now. Let me know if there's anything else you can see. |
d439b60 to
cfc2ea1
Compare
cthoyt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try and get the tests to pass, then I can take another look and we can get to refactoring.
|
@cthoyt d439b60 actually introduces a change in behavior. Previously, a mapping would only be added to the split if its subject prefix, object prefix, and relation were all in the iterables passed as arguments to the function. This was the purpose of building up a dict first (rather than using defaultdict)-- to have a known set of groups which were meant to be populated. After this change, all mappings are added to the split regardless of whether that's the case. Note that this won't change behavior in the case of |
|
I realized that almost immediately and reverted it. Should be back to what you had though this can definitely use refactoring because it is hard to understand |
|
Understood. Most of the complexity derives from having to keep track of state for logging purposes. Is this logging even necessary?: This will never be triggered if someone calls |
|
I'm closing this one in favor of #611. |
There was too much iteration going on in
parsers.split_dataframe_by_prefix-- for every combination of prefixes*relations*objects, the entire input dataframe was being iterated over. This was a lot of overhead, when it only needs to be gone through once.This is basically the same approach used in sssom-java, but there is some more work being done to keep logging consistent with the previous approach. If there were no logging, the dataclass included here (which reduces repetition), would not be necessary.
In testing splitting
mondo.sssom.tsv(a 25mb file), the time to split went from ~5 minutes to ~50 seconds. Nearly all of the time spent is on callingparsers.from_sssom_dataframe, which is slow but inevitable.Fixes #607