Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate enzymes by whether they are produced by bacteria or fungi #766

Open
wants to merge 8 commits into
base: develop
Choose a base branch
from

Conversation

jacobcook1995
Copy link
Collaborator

Description

This PR splits the enzyme pools (which were previously only split by substrate) based on whether they are produced by bacteria or fungi. The reason for making this split is that fungi (all eukaryotes really) produce substantively more complex enzyme than bacteria, and we want to capture this in someway. I've done this in a fairly straightforward manner by just defining two new enzyme pools with new constants to represent that these fungal enzymes are (generally) more effective.

As always feedback on code style/readablity etc would be appreciated

Fixes #761

Type of change

  • New feature (non-breaking change which adds functionality)
  • Optimization (back-end change that speeds up the code)
  • Bug fix (non-breaking change which fixes an issue)

Key checklist

  • Make sure you've run the pre-commit checks: $ pre-commit run -a
  • All tests pass: $ poetry run pytest

Further checks

  • Code is commented, particularly in hard-to-understand areas
  • Tests added that prove fix is effective or that feature works
  • Relevant documentation reviewed and updated

@codecov-commenter
Copy link

codecov-commenter commented Mar 5, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.77%. Comparing base (52f15b8) to head (fc4c04a).
Report is 6 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #766      +/-   ##
===========================================
+ Coverage    94.72%   94.77%   +0.05%     
===========================================
  Files           75       75              
  Lines         5197     5248      +51     
===========================================
+ Hits          4923     4974      +51     
  Misses         274      274              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Collaborator

@dalonsoa dalonsoa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced about all the code duplication this brings to the table. The same can probably be said about other parts of the code where the same operations are done on a finite, hardcoded set of options.

I am going to suggest an alternative, and we can discuss if it's unclear or feels cumbersome.

  1. Define an enumeration that contains fungi and bacteria:
from enum import Enum

class Microbium(str, Enum):  # You can get a better name for this
    FUNGI = "fungi"
    BACTERIA = "bacteria"
  1. Define all the relevant constants as dict[Microbium, float] and assign them as, eg.:
half_sat_pom_decomposition: dict[Microbium, float] = {}
half_sat_pom_decomposition[Microbium.BACTERIA]: float = 70.0
half_sat_pom_decomposition[Microbium.FUNGI]: float = 35.0
  1. For variables in the Data object, it is trickier, to be honest. One option will be to add another axes containing the Microbium data, but that's a lot of work and I'm not sure how good the support is for non-spacial axes. So I suggest to just have independent variables and construct the variable names on the fly as needed, which typically will be in loops. Eg.
for m in Microbium:
    data[f"some_variable_{m.value}"] = complex_calculation(
        ...
        data[f"other_variable_{m.value}"]
     )

As I say, this might have some rough edges and aspects we need to look in detail, but I feel it will make the code cleaner in the long run and might be applicable to other use cases.

@jacobcook1995
Copy link
Collaborator Author

jacobcook1995 commented Mar 5, 2025

@dalonsoa, I hadn't come across Enums before. I think that approach seems sensible, I'll give implementing it a go and let you know if I run into any issues implementing it.

@jacobcook1995
Copy link
Collaborator Author

jacobcook1995 commented Mar 5, 2025

I guess the issue I'm running into is how this would work with the existing constants system?

ruff doesn't like half_sat_pom_decomposition: dict[Microbium, float] = {} because {} is a mutable default value within a frozen dataclass.

Also how would non-default constant values be provided? At the moment, an alternative value for e.g. half_sat_pom_decomposition_bacteria would be provided in the set of toml configs files like so

[soil.SoilConsts]
half_sat_pom_decomposition_bacteria = 123.4

I guess the new system would work as follows

[soil.SoilConsts]
half_sat_pom_decomposition = {"bacteria": 123.4, "fungi": 56.7}

The only issue I can see there is that both values would always have to be provided when one of the defaults is being changed, which is more complex, but I suppose not the end of the world.

@davidorme
Copy link
Collaborator

3. For variables in the Data object, it is trickier, to be honest. One option will be to add another axes containing the Microbium data, but that's a lot of work and I'm not sure how good the support is for non-spacial axes. So I suggest to just have independent variables and construct the variable names on the fly as needed, which typically will be in loops. Eg.

I think there is a problem there with the variables system - those would also need to be defined dynamically?

@davidorme
Copy link
Collaborator

Isn't this basically the same issue that lead to the microbial_groups structure over in #758. We have this confusing difference between the config and constants, but it feels like this would be much easier to handle if these value are simply piggybacking that new functionality for providing separate fungal and bacterial parameterisation. That is already encapsulated in classes neatly and we don't need to use enum lookups.

Couldn't the parameters just go in here:

@dataclass(frozen=True)
class MicrobialGroupConstants:
"""Container for the set of constants associated with a microbial functional group.
This sets out the constants which must be defined for each microbial functional
group.
"""

@davidorme
Copy link
Collaborator

I guess I'd also try and reduce duplication by allowing multiple instances of classes rather than duplicating names within them. So that for example:

@dataclass
class EnzymePoolChanges:
    """Changes to the different enzyme pools due to production and denaturation."""

    net_change_pom: NDArray[np.float32]
    """Net change in the produced enzyme pool that breaks down :term:`POM`.
    
    Units of [kg C m^-3 day^-1]
    """

    net_change_maom: NDArray[np.float32]
    """Net change in the produced enzyme pool that breaks down :term:`MAOM`.
    
    Units of [kg C m^-3 day^-1]
    """

    denaturation: NDArray[np.float32]
    """Total denaturation rate for all the enzyme produced by bacteria.
    
    Units of [kg C m^-3 day^-1]
    """

But then you might have a set of EnzymePoolChanges keyed by group. You would then need separate classes or functions to do the integrative calculations across sets of classes, but then this is expandable in a much cleaner way.

@jacobcook1995
Copy link
Collaborator Author

jacobcook1995 commented Mar 5, 2025

Isn't this basically the same issue that lead to the microbial_groups structure over in #758. We have this confusing difference between the config and constants, but it feels like this would be much easier to handle if these value are simply piggybacking that new functionality for providing separate fungal and bacterial parameterisation. That is already encapsulated in classes neatly and we don't need to use enum lookups.

Couldn't the parameters just go in here:

@dataclass(frozen=True)
class MicrobialGroupConstants:
"""Container for the set of constants associated with a microbial functional group.
This sets out the constants which must be defined for each microbial functional
group.
"""

At the moment they could go in there. The issue is that down the road the microbial groups will get further differentiated, and adding an extra functional group shouldn't necessarily imply that extra enzymes need to be added (e.g. I really don't see a value in us trying to distinguish enzymes produced by saprotrophic fungi from those produced by mycorrhizal fungi).

I suppose we could make another data class for the different enzymes, and then add something to the MicrobialGroupConstants class to capture which (if any) enzymes each functional group produces. Do you reckon that would that resolve the repetitive nature of some of the code?

@davidorme
Copy link
Collaborator

I guess it seems sensible to me to have parameters for the different microbial groups come in via the same mechanism.

It does get complicated if you don't need the full "table" of parameters for every group. You'd have to make attributes that aren't required in all classes optional.

There is then a question about enforcement. With a dataclass, we could add a __post_init__ method that checks the required values are present? Or alternatively we could make a base class and then have the different groups extend each dataclass with the parameters specifically needed for that dataclass. That feels a little heavy handed.

Would you ever have a situation where a parameter must be identical between two groups? It could always be configured as identical but would we ever need to enforce that?

@davidorme
Copy link
Collaborator

This is a sketch - I kind of feel like subclasses would be more canonical though!

from dataclasses import dataclass
from typing import Optional, ClassVar


@dataclass(frozen=True)
class MFT:
    name: str
    a: float
    b: Optional[float] = None

    required: ClassVar[dict] = {
        "aaa": ["a", "b"],
        "bbb": ["a"],
    }

    def __post_init__(self):
        if self.name not in self.required:
            raise ValueError(f"Unknown name value: {self.name}")

        not_populated = [
            attr for attr in self.required[self.name] if getattr(self, attr) is None
        ]

        if not_populated:
            raise ValueError(
                f"The following attributes cannot be undefined for {self.name}: "
                f"{', '.join(not_populated)}"
            )


MFT(name="aab", a=1, b=1)  # Fails - unknown name
MFT(name="aaa", a=1, b=1)  # Good
MFT(name="aaa", a=1)  # Fails, missing b
MFT(name="bbb", a=1)  # Good

@jacobcook1995
Copy link
Collaborator Author

I guess it seems sensible to me to have parameters for the different microbial groups come in via the same mechanism.

I guess the problem is that some of the enzymes will be shared between functional groups, e.g. I want to differentiate enzymes based on substrate and whether they were produced by a prokaryote or a eukaryote. Parameterising differences between enzymes based on the specific class of fungi that produced them is to my best knowledge nearly impossible, so I don't really want to introduce it as an option

@davidorme
Copy link
Collaborator

That's what I feared. You've basically got a tree of parameter definitions, some of which are defined at the tips of groups and some of which are named at internal nodes for all children of that branch. I guess the question then is how much is it worth trying to capture that structure within the code as opposed to just defining a load of parameters and then having overlapping code.

@dalonsoa
Copy link
Collaborator

dalonsoa commented Mar 7, 2025

An interesting problem the one you have here...

I think, based on what I understand of the problem, subclassing might be the right way to go as it allows, essentially, an infinitely deep tree of parameter definitions with branching and overriding as needed, and with all the children having access to the parameters of the parents. Not sure if dataclasses would be the way to go, though, as you might need a more complex logic to handle arguments with default values.

@jacobcook1995 jacobcook1995 requested a review from dalonsoa March 10, 2025 10:53
@jacobcook1995
Copy link
Collaborator Author

jacobcook1995 commented Mar 10, 2025

@dalonsoa @davidorme, I've tried to address your comments by implementing a new EnzymeConstants data class. This stores the constants related to each enzyme. At the moment it doesn't interact with the MicrobialGroupConstants dataclass in anyway, but in future I plan to link them by each MicrobialGroupConstants instance having a list of enzymes it produces (and this logic can be validated). I think that makes sense for a future PR though.

I've also tried to refactor the two worst offenders on the "dense hardcoded logic front" which were calculate_enzyme_mediated_rates and calculate_enzyme_changes. There are other areas that can be improved but I think it makes sense to leave them for now as this is code that I'll have to refactor anyway for subsequent PRs.

Anyway let me know if you think that my approach seems sensible or not!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bacteria and fungi should produce different types of enzyme
4 participants