Skip to content

Conversation

@ptgolden
Copy link

@ptgolden ptgolden commented Sep 5, 2025

There was too much iteration going on in parsers.split_dataframe_by_prefix-- for every combination of prefixes*relations*objects, the entire input dataframe was being iterated over. This was a lot of overhead, when it only needs to be gone through once.

This is basically the same approach used in sssom-java, but there is some more work being done to keep logging consistent with the previous approach. If there were no logging, the dataclass included here (which reduces repetition), would not be necessary.

In testing splitting mondo.sssom.tsv (a 25mb file), the time to split went from ~5 minutes to ~50 seconds. Nearly all of the time spent is on calling parsers.from_sssom_dataframe, which is slow but inevitable.

Fixes #607

There was a bit too much iteration going on in
`parsers.split_dataframe_by_prefix`-- for every combination of
prefixes*relations*objects, the entire input dataframe was being
iterated over. This was a lot of overhead, when it only needs to be gone
through once.

This is basically the same approach used in sssom-java (see note at
bottom), but there is some more work being done to keep logging
consistent with the previous approach.

In testing splitting `mondo.sssom.tsv` (a 25mb file), the time to split
went from ~5 minutes to ~50 seconds. Nearly all of the time spent is on
calling `parsers.from_sssom_dataframe`, which is slow but inevitable.

<https://github.com/gouttegd/sssom-java/blob/main/cli/src/main/java/org/incenp/obofoundry/sssom/cli/SimpleCLI.java#L646-L670>
@cthoyt
Copy link
Member

cthoyt commented Sep 5, 2025

@ptgolden that's an incredible speedup! Nice work! Do you have a link to this mondo.sssom.tsv for my own testing purposes?



@dataclass(frozen=True, eq=True)
class SSSOMSplitTriple:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use a named tuple here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely could do! Although that would preclude having methods on the class. (Those methods were for convenience, since they're used twice).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's true. You can still have class methods and instance methods when you subclass from typing.NamedTuple

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, true, a subclass of NamedTuple could add those methods. I am happy to change that- though what is the advantage over a hashable dataclass (i.e. frozen=True) at that point?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow, subclassing NamedTuple is a good deal more ergonomic than it used to be. Change incoming!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realized I left a useless __post_init__ method in there also.

relation: str

def __post_init__(self):
relation_prefix, relation_id = self.relation.split(":")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the curies package for CURIE parsing! Within the MSDF, you can access the converter directly with msdf.converter and use msdf.converter.parse_curie

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do 👍

@ptgolden
Copy link
Author

ptgolden commented Sep 5, 2025

@ptgolden that's an incredible speedup! Nice work! Do you have a link to this mondo.sssom.tsv for my own testing purposes?

Just replied at #608 (comment)

Copy link
Collaborator

@matentzn matentzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much!

My python reading is a bit rusty.

  1. I checked the tests and they make sense
  2. The code is well written. I do not know whether OO (i.e. SSSOMSplitTriple) is the way things are done still, but I dont see a problem.

@cthoyt do you have any major objections (since this is, or might be, an interim solution and it will help me shave A LOT of time of my existing pipelines).

f"{self.json_file} has the wrong number of mappings.",
)

def test_split_msdf(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the tests!!!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized that the test for split in the CLI only checked if the command ran without error, without checking any output 😵‍💫

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don’t get me on a rant about the cli testing here…

@cthoyt
Copy link
Member

cthoyt commented Sep 5, 2025

@matentzn yes there are major objections to the way this is written, I will gently guide @ptgolden though writing nicer python code before making a merge

Renamed the NamedTuple which dictates how mappings should be split, from
SSSOMSplitTriple to SSSOMSplitGroup. This tuple consists of the
combination of:

  1. subject_prefix
  2. object_prefix
  3. relation_curie

This clarifies the purpose of the class-- it may be useful to group
instead by only 1 & 2 (this is what the `split` cli option in sssom-java
does).

Added/clarified documentation.
...just renamed some variables. would have amended my previous commit
but I had already pushed up.
@ptgolden
Copy link
Author

ptgolden commented Sep 5, 2025

Thanks, @cthoyt. I've refactored a bit and I think it's a good deal better now. Let me know if there's anything else you can see.

@cthoyt cthoyt force-pushed the split-dataframe-optimization branch from d439b60 to cfc2ea1 Compare September 6, 2025 14:39
Copy link
Member

@cthoyt cthoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try and get the tests to pass, then I can take another look and we can get to refactoring.

@ptgolden
Copy link
Author

ptgolden commented Sep 6, 2025

@cthoyt d439b60 actually introduces a change in behavior. Previously, a mapping would only be added to the split if its subject prefix, object prefix, and relation were all in the iterables passed as arguments to the function. This was the purpose of building up a dict first (rather than using defaultdict)-- to have a known set of groups which were meant to be populated. After this change, all mappings are added to the split regardless of whether that's the case.

Note that this won't change behavior in the case of split_dataframe (since all of the object prefixes, subject prefixes, and relations are passed to split_dataframe_by_prefix there), but it would affect behavior when called with different arguments.

@cthoyt
Copy link
Member

cthoyt commented Sep 6, 2025

I realized that almost immediately and reverted it. Should be back to what you had

though this can definitely use refactoring because it is hard to understand

@ptgolden
Copy link
Author

ptgolden commented Sep 6, 2025

Understood. Most of the complexity derives from having to keep track of state for logging purposes. Is this logging even necessary?:

        if subject_prefix not in msdf.converter.bimap:
            logging.warning(f"{split_id} - missing subject prefix - {subject_prefix}")
            continue
        if object_prefix not in msdf.converter.bimap:
            logging.warning(f"{split_id} - missing object prefix - {object_prefix}")
            continue

This will never be triggered if someone calls split_dataframe, only if they call split_dataframe_by_prefix with other arguments, in which case they probably explicitly want to skip certain prefixes.

@ptgolden
Copy link
Author

ptgolden commented Sep 7, 2025

I'm closing this one in favor of #611.

@ptgolden ptgolden closed this Sep 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unnecessary repeated iteration in split_dataframe_by_prefix

3 participants