Skip to content

Conversation

@ptgolden
Copy link

@ptgolden ptgolden commented Sep 7, 2025

Replaces #609. I started from scratch with a total rewrite rather than adding to the commit history of that one. The output here is identical, but it's hopefully easier to follow.

The point is the same: when splitting a MappingSetDataFrame by l subject prefixes, m object prefixes, and n relations, iterate over msdf.df once rather than l * m * n times.

In the previous implementation of the function, the entire
MappingSetDataFrame was being iterated over many more times than
necessary. This changes it to only go through once. All logging and
output remains the same.
@ptgolden ptgolden force-pushed the split-dataframe-optimization-v2 branch from 6f313d7 to 0e7cef7 Compare September 7, 2025 04:32
Comment on lines 1036 to 1028
for mapping in msdf.df.itertuples(index=False):
group = SSSOMSplitGroup(
parse_curie(getattr(mapping, SUBJECT_ID), strict=True).prefix,
parse_curie(getattr(mapping, OBJECT_ID), strict=True).prefix,
parse_curie(getattr(mapping, PREDICATE_ID), strict=True),
)
Copy link
Member

@cthoyt cthoyt Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's a more straighforward way for iterating over the subset of columns you want:

Suggested change
for mapping in msdf.df.itertuples(index=False):
group = SSSOMSplitGroup(
parse_curie(getattr(mapping, SUBJECT_ID), strict=True).prefix,
parse_curie(getattr(mapping, OBJECT_ID), strict=True).prefix,
parse_curie(getattr(mapping, PREDICATE_ID), strict=True),
)
for subject_id, predicate_id, object_id in df[[SUBJECT_ID, PREDICATE_ID, OBJECT_ID]].values:
group = SSSOMSplitGroup(
parse_curie(subject_id, strict=True).prefix,
parse_curie(object_id, strict=True).prefix,
parse_curie(predicate_id, strict=True),
)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change. I used itertuples() on purpose because I thought it would save some memory/processing, but after profiling- it doesn't!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I remember the other reason I did it. In the next line, I'm storing the whole row (as a NamedTuple) in a list (mappings_by_group[group].append(mapping)). That list is parsed into a dataframe later:

    for group in expected_split_groups:
        ...
        mappings = mappings_by_group.get(group, None)
        ...
        split_to_msdf[split] = from_sssom_dataframe(
            pd.DataFrame(mappings), prefix_map=dict(subconverter.bimap), meta=meta
        )

@cthoyt
Copy link
Member

cthoyt commented Sep 8, 2025

@ptgolden I just added a test case to make sure we don't break backwards compatibility with handling of CURIEs that aren't registered properly in the internal converter. We'll also want to address this

@cthoyt cthoyt force-pushed the split-dataframe-optimization-v2 branch from 492e763 to 6660a26 Compare September 22, 2025 14:38
@cthoyt
Copy link
Member

cthoyt commented Sep 22, 2025

I just merged all of the code together so we could get a better look at the time:

$ wget https://github.com/monarch-initiative/mondo/raw/refs/heads/master/src/ontology/mappings/mondo.sssom.tsv
$ mkdir results

$ time sssom split mondo.sssom.tsv -d results
________________________________________________________
Executed in   14.16 secs    fish           external
   usr time   14.16 secs    0.19 millis   14.16 secs
   sys time    3.49 secs    2.25 millis    3.48 secs

$ time sssom split mondo.sssom.tsv -d results --method patrick
________________________________________________________
Executed in   12.76 secs    fish           external
   usr time   12.80 secs    0.20 millis   12.80 secs
   sys time    3.44 secs    2.35 millis    3.43 secs

$ time sssom split mondo.sssom.tsv -d results --method disjoint-indexes
________________________________________________________
Executed in   12.74 secs    fish           external
   usr time   12.68 secs    0.12 millis   12.68 secs
   sys time    3.59 secs    1.49 millis    3.58 secs

$ time sssom split mondo.sssom.tsv -d results --method dense-indexes
________________________________________________________
Executed in   12.62 secs    fish           external
   usr time   12.64 secs    0.19 millis   12.64 secs
   sys time    3.51 secs    2.29 millis    3.50 secs

It might also be worth doing a memory profiling. If they're all comparable, then I would definitely opt for more idiomatic, reusable code rather than using a new custom implementation. @ptgolden can you run the same profiles I did above and let me know what it looks like?

@cthoyt cthoyt marked this pull request as draft September 29, 2025 12:46
@cthoyt cthoyt changed the title Optimize split_dataframe_by_prefix Add alternate implementation for split_dataframe_by_prefix() Sep 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants