-
Notifications
You must be signed in to change notification settings - Fork 14
Add alternate implementation for split_dataframe_by_prefix()
#611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add alternate implementation for split_dataframe_by_prefix()
#611
Conversation
In the previous implementation of the function, the entire MappingSetDataFrame was being iterated over many more times than necessary. This changes it to only go through once. All logging and output remains the same.
6f313d7 to
0e7cef7
Compare
| for mapping in msdf.df.itertuples(index=False): | ||
| group = SSSOMSplitGroup( | ||
| parse_curie(getattr(mapping, SUBJECT_ID), strict=True).prefix, | ||
| parse_curie(getattr(mapping, OBJECT_ID), strict=True).prefix, | ||
| parse_curie(getattr(mapping, PREDICATE_ID), strict=True), | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here's a more straighforward way for iterating over the subset of columns you want:
| for mapping in msdf.df.itertuples(index=False): | |
| group = SSSOMSplitGroup( | |
| parse_curie(getattr(mapping, SUBJECT_ID), strict=True).prefix, | |
| parse_curie(getattr(mapping, OBJECT_ID), strict=True).prefix, | |
| parse_curie(getattr(mapping, PREDICATE_ID), strict=True), | |
| ) | |
| for subject_id, predicate_id, object_id in df[[SUBJECT_ID, PREDICATE_ID, OBJECT_ID]].values: | |
| group = SSSOMSplitGroup( | |
| parse_curie(subject_id, strict=True).prefix, | |
| parse_curie(object_id, strict=True).prefix, | |
| parse_curie(predicate_id, strict=True), | |
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change. I used itertuples() on purpose because I thought it would save some memory/processing, but after profiling- it doesn't!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I remember the other reason I did it. In the next line, I'm storing the whole row (as a NamedTuple) in a list (mappings_by_group[group].append(mapping)). That list is parsed into a dataframe later:
for group in expected_split_groups:
...
mappings = mappings_by_group.get(group, None)
...
split_to_msdf[split] = from_sssom_dataframe(
pd.DataFrame(mappings), prefix_map=dict(subconverter.bimap), meta=meta
)
|
@ptgolden I just added a test case to make sure we don't break backwards compatibility with handling of CURIEs that aren't registered properly in the internal converter. We'll also want to address this |
492e763 to
6660a26
Compare
|
I just merged all of the code together so we could get a better look at the time: $ wget https://github.com/monarch-initiative/mondo/raw/refs/heads/master/src/ontology/mappings/mondo.sssom.tsv
$ mkdir results
$ time sssom split mondo.sssom.tsv -d results
________________________________________________________
Executed in 14.16 secs fish external
usr time 14.16 secs 0.19 millis 14.16 secs
sys time 3.49 secs 2.25 millis 3.48 secs
$ time sssom split mondo.sssom.tsv -d results --method patrick
________________________________________________________
Executed in 12.76 secs fish external
usr time 12.80 secs 0.20 millis 12.80 secs
sys time 3.44 secs 2.35 millis 3.43 secs
$ time sssom split mondo.sssom.tsv -d results --method disjoint-indexes
________________________________________________________
Executed in 12.74 secs fish external
usr time 12.68 secs 0.12 millis 12.68 secs
sys time 3.59 secs 1.49 millis 3.58 secs
$ time sssom split mondo.sssom.tsv -d results --method dense-indexes
________________________________________________________
Executed in 12.62 secs fish external
usr time 12.64 secs 0.19 millis 12.64 secs
sys time 3.51 secs 2.29 millis 3.50 secsIt might also be worth doing a memory profiling. If they're all comparable, then I would definitely opt for more idiomatic, reusable code rather than using a new custom implementation. @ptgolden can you run the same profiles I did above and let me know what it looks like? |
split_dataframe_by_prefix()
Replaces #609. I started from scratch with a total rewrite rather than adding to the commit history of that one. The output here is identical, but it's hopefully easier to follow.
The point is the same: when splitting a MappingSetDataFrame by
lsubject prefixes,mobject prefixes, andnrelations, iterate overmsdf.dfonce rather thanl * m * ntimes.