Add alternate implementation for `split_dataframe_by_prefix()` #611

ptgolden · 2025-09-07T04:21:26Z

Replaces #609. I started from scratch with a total rewrite rather than adding to the commit history of that one. The output here is identical, but it's hopefully easier to follow.

The point is the same: when splitting a MappingSetDataFrame by l subject prefixes, m object prefixes, and n relations, iterate over msdf.df once rather than l * m * n times.

In the previous implementation of the function, the entire MappingSetDataFrame was being iterated over many more times than necessary. This changes it to only go through once. All logging and output remains the same.

cthoyt · 2025-09-08T13:57:43Z

src/sssom/parsers.py

+    for mapping in msdf.df.itertuples(index=False):
+        group = SSSOMSplitGroup(
+            parse_curie(getattr(mapping, SUBJECT_ID), strict=True).prefix,
+            parse_curie(getattr(mapping, OBJECT_ID), strict=True).prefix,
+            parse_curie(getattr(mapping, PREDICATE_ID), strict=True),
+        )


here's a more straighforward way for iterating over the subset of columns you want:

Suggested change

for mapping in msdf.df.itertuples(index=False):

group = SSSOMSplitGroup(

parse_curie(getattr(mapping, SUBJECT_ID), strict=True).prefix,

parse_curie(getattr(mapping, OBJECT_ID), strict=True).prefix,

parse_curie(getattr(mapping, PREDICATE_ID), strict=True),

)

for subject_id, predicate_id, object_id in df[[SUBJECT_ID, PREDICATE_ID, OBJECT_ID]].values:

group = SSSOMSplitGroup(

parse_curie(subject_id, strict=True).prefix,

parse_curie(object_id, strict=True).prefix,

parse_curie(predicate_id, strict=True),

)

Will change. I used itertuples() on purpose because I thought it would save some memory/processing, but after profiling- it doesn't!

Ah, I remember the other reason I did it. In the next line, I'm storing the whole row (as a NamedTuple) in a list (mappings_by_group[group].append(mapping)). That list is parsed into a dataframe later:

for group in expected_split_groups: ... mappings = mappings_by_group.get(group, None) ... split_to_msdf[split] = from_sssom_dataframe( pd.DataFrame(mappings), prefix_map=dict(subconverter.bimap), meta=meta )

cthoyt · 2025-09-08T15:29:48Z

@ptgolden I just added a test case to make sure we don't break backwards compatibility with handling of CURIEs that aren't registered properly in the internal converter. We'll also want to address this

cthoyt · 2025-09-22T14:53:02Z

I just merged all of the code together so we could get a better look at the time:

$ wget https://github.com/monarch-initiative/mondo/raw/refs/heads/master/src/ontology/mappings/mondo.sssom.tsv
$ mkdir results

$ time sssom split mondo.sssom.tsv -d results
________________________________________________________
Executed in   14.16 secs    fish           external
   usr time   14.16 secs    0.19 millis   14.16 secs
   sys time    3.49 secs    2.25 millis    3.48 secs

$ time sssom split mondo.sssom.tsv -d results --method patrick
________________________________________________________
Executed in   12.76 secs    fish           external
   usr time   12.80 secs    0.20 millis   12.80 secs
   sys time    3.44 secs    2.35 millis    3.43 secs

$ time sssom split mondo.sssom.tsv -d results --method disjoint-indexes
________________________________________________________
Executed in   12.74 secs    fish           external
   usr time   12.68 secs    0.12 millis   12.68 secs
   sys time    3.59 secs    1.49 millis    3.58 secs

$ time sssom split mondo.sssom.tsv -d results --method dense-indexes
________________________________________________________
Executed in   12.62 secs    fish           external
   usr time   12.64 secs    0.19 millis   12.64 secs
   sys time    3.51 secs    2.29 millis    3.50 secs

It might also be worth doing a memory profiling. If they're all comparable, then I would definitely opt for more idiomatic, reusable code rather than using a new custom implementation. @ptgolden can you run the same profiles I did above and let me know what it looks like?

ptgolden mentioned this pull request Sep 7, 2025

Split dataframe optimization #609

Closed

Optimize split_dataframe_by_prefix

0e7cef7

In the previous implementation of the function, the entire MappingSetDataFrame was being iterated over many more times than necessary. This changes it to only go through once. All logging and output remains the same.

ptgolden force-pushed the split-dataframe-optimization-v2 branch from 6f313d7 to 0e7cef7 Compare September 7, 2025 04:32

cthoyt reviewed Sep 8, 2025

View reviewed changes

cthoyt mentioned this pull request Sep 8, 2025

Improve dataframe split test #614

Merged

cthoyt mentioned this pull request Sep 22, 2025

Add alternate dataframe split implementations #608

Merged

Update parsers.py

6660a26

cthoyt force-pushed the split-dataframe-optimization-v2 branch from 492e763 to 6660a26 Compare September 22, 2025 14:38

cthoyt added 3 commits September 22, 2025 16:39

Merge branch 'master' into pr/611

9954e98

Update parsers.py

77e02f8

Add FIXMEs where fail on unregistered prefixes

5d85412

cthoyt marked this pull request as draft September 29, 2025 12:46

cthoyt changed the title ~~Optimize split_dataframe_by_prefix~~ Add alternate implementation for split_dataframe_by_prefix() Sep 29, 2025

cthoyt mentioned this pull request Oct 9, 2025

Unnecessary repeated iteration in split_dataframe_by_prefix #607

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add alternate implementation for `split_dataframe_by_prefix()` #611

Add alternate implementation for `split_dataframe_by_prefix()` #611

Uh oh!

ptgolden commented Sep 7, 2025

Uh oh!

cthoyt Sep 8, 2025 •

edited

Loading

Uh oh!

ptgolden Sep 9, 2025

Uh oh!

ptgolden Sep 12, 2025

Uh oh!

cthoyt commented Sep 8, 2025

Uh oh!

cthoyt commented Sep 22, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add alternate implementation for split_dataframe_by_prefix() #611

Are you sure you want to change the base?

Add alternate implementation for split_dataframe_by_prefix() #611

Uh oh!

Conversation

ptgolden commented Sep 7, 2025

Uh oh!

cthoyt Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ptgolden Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

ptgolden Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

cthoyt commented Sep 8, 2025

Uh oh!

cthoyt commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add alternate implementation for `split_dataframe_by_prefix()` #611

Add alternate implementation for `split_dataframe_by_prefix()` #611

cthoyt Sep 8, 2025 •

edited

Loading

cthoyt commented Sep 22, 2025 •

edited

Loading