Split dataframe optimization #609

ptgolden · 2025-09-05T15:16:14Z

There was too much iteration going on in parsers.split_dataframe_by_prefix-- for every combination of prefixes*relations*objects, the entire input dataframe was being iterated over. This was a lot of overhead, when it only needs to be gone through once.

This is basically the same approach used in sssom-java, but there is some more work being done to keep logging consistent with the previous approach. If there were no logging, the dataclass included here (which reduces repetition), would not be necessary.

In testing splitting mondo.sssom.tsv (a 25mb file), the time to split went from ~5 minutes to ~50 seconds. Nearly all of the time spent is on calling parsers.from_sssom_dataframe, which is slow but inevitable.

Fixes #607

There was a bit too much iteration going on in `parsers.split_dataframe_by_prefix`-- for every combination of prefixes*relations*objects, the entire input dataframe was being iterated over. This was a lot of overhead, when it only needs to be gone through once. This is basically the same approach used in sssom-java (see note at bottom), but there is some more work being done to keep logging consistent with the previous approach. In testing splitting `mondo.sssom.tsv` (a 25mb file), the time to split went from ~5 minutes to ~50 seconds. Nearly all of the time spent is on calling `parsers.from_sssom_dataframe`, which is slow but inevitable. <https://github.com/gouttegd/sssom-java/blob/main/cli/src/main/java/org/incenp/obofoundry/sssom/cli/SimpleCLI.java#L646-L670>

cthoyt · 2025-09-05T15:23:54Z

@ptgolden that's an incredible speedup! Nice work! Do you have a link to this mondo.sssom.tsv for my own testing purposes?

cthoyt · 2025-09-05T15:24:22Z

src/sssom/parsers.py



+@dataclass(frozen=True, eq=True)
+class SSSOMSplitTriple:


why not use a named tuple here?

Definitely could do! Although that would preclude having methods on the class. (Those methods were for convenience, since they're used twice).

I don't think that's true. You can still have class methods and instance methods when you subclass from typing.NamedTuple

Yes, true, a subclass of NamedTuple could add those methods. I am happy to change that- though what is the advantage over a hashable dataclass (i.e. frozen=True) at that point?

Oh wow, subclassing NamedTuple is a good deal more ergonomic than it used to be. Change incoming!

Just realized I left a useless __post_init__ method in there also.

cthoyt · 2025-09-05T15:25:02Z

src/sssom/parsers.py

+    relation: str
+
+    def __post_init__(self):
+        relation_prefix, relation_id = self.relation.split(":")


use the curies package for CURIE parsing! Within the MSDF, you can access the converter directly with msdf.converter and use msdf.converter.parse_curie

Will do 👍

ptgolden · 2025-09-05T15:29:23Z

@ptgolden that's an incredible speedup! Nice work! Do you have a link to this mondo.sssom.tsv for my own testing purposes?

Just replied at #608 (comment)

matentzn

Thank you so much!

My python reading is a bit rusty.

I checked the tests and they make sense
The code is well written. I do not know whether OO (i.e. SSSOMSplitTriple) is the way things are done still, but I dont see a problem.

@cthoyt do you have any major objections (since this is, or might be, an interim solution and it will help me shave A LOT of time of my existing pipelines).

matentzn · 2025-09-05T15:26:04Z

tests/test_parsers.py

            f"{self.json_file} has the wrong number of mappings.",
        )

+    def test_split_msdf(self):


Thank you for the tests!!!

I realized that the test for split in the CLI only checked if the command ran without error, without checking any output 😵‍💫

Don’t get me on a rant about the cli testing here…

cthoyt · 2025-09-05T15:36:28Z

@matentzn yes there are major objections to the way this is written, I will gently guide @ptgolden though writing nicer python code before making a merge

Renamed the NamedTuple which dictates how mappings should be split, from SSSOMSplitTriple to SSSOMSplitGroup. This tuple consists of the combination of: 1. subject_prefix 2. object_prefix 3. relation_curie This clarifies the purpose of the class-- it may be useful to group instead by only 1 & 2 (this is what the `split` cli option in sssom-java does). Added/clarified documentation.

...just renamed some variables. would have amended my previous commit but I had already pushed up.

ptgolden · 2025-09-05T22:22:26Z

Thanks, @cthoyt. I've refactored a bit and I think it's a good deal better now. Let me know if there's anything else you can see.

cthoyt

Try and get the tests to pass, then I can take another look and we can get to refactoring.

ptgolden · 2025-09-06T14:44:28Z

@cthoyt d439b60 actually introduces a change in behavior. Previously, a mapping would only be added to the split if its subject prefix, object prefix, and relation were all in the iterables passed as arguments to the function. This was the purpose of building up a dict first (rather than using defaultdict)-- to have a known set of groups which were meant to be populated. After this change, all mappings are added to the split regardless of whether that's the case.

Note that this won't change behavior in the case of split_dataframe (since all of the object prefixes, subject prefixes, and relations are passed to split_dataframe_by_prefix there), but it would affect behavior when called with different arguments.

cthoyt · 2025-09-06T14:54:38Z

I realized that almost immediately and reverted it. Should be back to what you had

though this can definitely use refactoring because it is hard to understand

ptgolden · 2025-09-06T14:59:22Z

Understood. Most of the complexity derives from having to keep track of state for logging purposes. Is this logging even necessary?:

        if subject_prefix not in msdf.converter.bimap:
            logging.warning(f"{split_id} - missing subject prefix - {subject_prefix}")
            continue
        if object_prefix not in msdf.converter.bimap:
            logging.warning(f"{split_id} - missing object prefix - {object_prefix}")
            continue

This will never be triggered if someone calls split_dataframe, only if they call split_dataframe_by_prefix with other arguments, in which case they probably explicitly want to skip certain prefixes.

ptgolden · 2025-09-07T04:22:44Z

I'm closing this one in favor of #611.

ptgolden added 2 commits September 4, 2025 23:27

Add a basic test for sssom.parsers.split_dataframe

c4b1757

ptgolden mentioned this pull request Sep 5, 2025

Unnecessary repeated iteration in split_dataframe_by_prefix #607

Open

Improve some names in split_dataframe_by_prefix

7233a42

cthoyt reviewed Sep 5, 2025

View reviewed changes

ptgolden mentioned this pull request Sep 5, 2025

Add alternate dataframe split implementations #608

Merged

matentzn reviewed Sep 5, 2025

View reviewed changes

ptgolden added 3 commits September 5, 2025 17:27

Use NamedTuple instead of Dataclass for constructing split triples

ef17d00

Code cleaning in split_dataframe_by_prefix

9015aa1

...just renamed some variables. would have amended my previous commit but I had already pushed up.

lint

7db6484

cthoyt mentioned this pull request Sep 6, 2025

Add test for split dataframe function #610

Merged

cthoyt added 2 commits September 6, 2025 16:34

Merge branch 'master' into pr/609

02c5ccf

Update parsers.py

cfc2ea1

cthoyt force-pushed the split-dataframe-optimization branch from d439b60 to cfc2ea1 Compare September 6, 2025 14:39

cthoyt requested changes Sep 6, 2025

View reviewed changes

ptgolden mentioned this pull request Sep 7, 2025

Add alternate implementation for split_dataframe_by_prefix() #611

Draft

ptgolden closed this Sep 7, 2025

Split dataframe optimization #609

Split dataframe optimization #609

Uh oh!

Conversation

ptgolden commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cthoyt commented Sep 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ptgolden commented Sep 5, 2025

Uh oh!

matentzn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cthoyt commented Sep 5, 2025

Uh oh!

ptgolden commented Sep 5, 2025

Uh oh!

cthoyt left a comment

Choose a reason for hiding this comment

Uh oh!

ptgolden commented Sep 6, 2025

Uh oh!

cthoyt commented Sep 6, 2025

Uh oh!

ptgolden commented Sep 6, 2025

Uh oh!

ptgolden commented Sep 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ptgolden commented Sep 5, 2025 •

edited

Loading