Separate enzymes by whether they are produced by bacteria or fungi #766

jacobcook1995 · 2025-03-05T10:03:40Z

Description

This PR splits the enzyme pools (which were previously only split by substrate) based on whether they are produced by bacteria or fungi. The reason for making this split is that fungi (all eukaryotes really) produce substantively more complex enzyme than bacteria, and we want to capture this in someway. I've done this in a fairly straightforward manner by just defining two new enzyme pools with new constants to represent that these fungal enzymes are (generally) more effective.

As always feedback on code style/readablity etc would be appreciated

Fixes #761

Type of change

New feature (non-breaking change which adds functionality)
Optimization (back-end change that speeds up the code)
Bug fix (non-breaking change which fixes an issue)

Key checklist

Make sure you've run the pre-commit checks: $ pre-commit run -a
All tests pass: $ poetry run pytest

Further checks

Code is commented, particularly in hard-to-understand areas
Tests added that prove fix is effective or that feature works
Relevant documentation reviewed and updated

…enzymes)

…AOM breakdown rates

…ferent-types-of-enzyme

codecov-commenter · 2025-03-05T10:10:02Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.77%. Comparing base (52f15b8) to head (fc4c04a).
Report is 6 commits behind head on develop.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop     #766      +/-   ##
===========================================
+ Coverage    94.72%   94.77%   +0.05%     
===========================================
  Files           75       75              
  Lines         5197     5248      +51     
===========================================
+ Hits          4923     4974      +51     
  Misses         274      274

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

dalonsoa

I'm not convinced about all the code duplication this brings to the table. The same can probably be said about other parts of the code where the same operations are done on a finite, hardcoded set of options.

I am going to suggest an alternative, and we can discuss if it's unclear or feels cumbersome.

Define an enumeration that contains fungi and bacteria:

from enum import Enum

class Microbium(str, Enum):  # You can get a better name for this
    FUNGI = "fungi"
    BACTERIA = "bacteria"

Define all the relevant constants as dict[Microbium, float] and assign them as, eg.:

half_sat_pom_decomposition: dict[Microbium, float] = {}
half_sat_pom_decomposition[Microbium.BACTERIA]: float = 70.0
half_sat_pom_decomposition[Microbium.FUNGI]: float = 35.0

For variables in the Data object, it is trickier, to be honest. One option will be to add another axes containing the Microbium data, but that's a lot of work and I'm not sure how good the support is for non-spacial axes. So I suggest to just have independent variables and construct the variable names on the fly as needed, which typically will be in loops. Eg.

for m in Microbium:
    data[f"some_variable_{m.value}"] = complex_calculation(
        ...
        data[f"other_variable_{m.value}"]
     )

As I say, this might have some rough edges and aspects we need to look in detail, but I feel it will make the code cleaner in the long run and might be applicable to other use cases.

jacobcook1995 · 2025-03-05T13:25:10Z

@dalonsoa, I hadn't come across Enums before. I think that approach seems sensible, I'll give implementing it a go and let you know if I run into any issues implementing it.

jacobcook1995 · 2025-03-05T13:56:11Z

I guess the issue I'm running into is how this would work with the existing constants system?

ruff doesn't like half_sat_pom_decomposition: dict[Microbium, float] = {} because {} is a mutable default value within a frozen dataclass.

Also how would non-default constant values be provided? At the moment, an alternative value for e.g. half_sat_pom_decomposition_bacteria would be provided in the set of toml configs files like so

[soil.SoilConsts]
half_sat_pom_decomposition_bacteria = 123.4

I guess the new system would work as follows

[soil.SoilConsts]
half_sat_pom_decomposition = {"bacteria": 123.4, "fungi": 56.7}

The only issue I can see there is that both values would always have to be provided when one of the defaults is being changed, which is more complex, but I suppose not the end of the world.

davidorme · 2025-03-05T14:27:21Z

3. For variables in the Data object, it is trickier, to be honest. One option will be to add another axes containing the Microbium data, but that's a lot of work and I'm not sure how good the support is for non-spacial axes. So I suggest to just have independent variables and construct the variable names on the fly as needed, which typically will be in loops. Eg.

I think there is a problem there with the variables system - those would also need to be defined dynamically?

davidorme · 2025-03-05T14:34:24Z

Isn't this basically the same issue that lead to the microbial_groups structure over in #758. We have this confusing difference between the config and constants, but it feels like this would be much easier to handle if these value are simply piggybacking that new functionality for providing separate fungal and bacterial parameterisation. That is already encapsulated in classes neatly and we don't need to use enum lookups.

Couldn't the parameters just go in here:

virtual_ecosystem/virtual_ecosystem/models/soil/microbial_groups.py

Lines 11 to 17 in 303661e

    
           @dataclass(frozen=True) 
        
           class MicrobialGroupConstants: 
        
               """Container for the set of constants associated with a microbial functional group. 
        
               This sets out the constants which must be defined for each microbial functional 
        
               group. 
        
               """

davidorme · 2025-03-05T14:40:36Z

I guess I'd also try and reduce duplication by allowing multiple instances of classes rather than duplicating names within them. So that for example:

@dataclass
class EnzymePoolChanges:
    """Changes to the different enzyme pools due to production and denaturation."""

    net_change_pom: NDArray[np.float32]
    """Net change in the produced enzyme pool that breaks down :term:`POM`.
    
    Units of [kg C m^-3 day^-1]
    """

    net_change_maom: NDArray[np.float32]
    """Net change in the produced enzyme pool that breaks down :term:`MAOM`.
    
    Units of [kg C m^-3 day^-1]
    """

    denaturation: NDArray[np.float32]
    """Total denaturation rate for all the enzyme produced by bacteria.
    
    Units of [kg C m^-3 day^-1]
    """

But then you might have a set of EnzymePoolChanges keyed by group. You would then need separate classes or functions to do the integrative calculations across sets of classes, but then this is expandable in a much cleaner way.

jacobcook1995 · 2025-03-05T14:44:58Z

Isn't this basically the same issue that lead to the microbial_groups structure over in #758. We have this confusing difference between the config and constants, but it feels like this would be much easier to handle if these value are simply piggybacking that new functionality for providing separate fungal and bacterial parameterisation. That is already encapsulated in classes neatly and we don't need to use enum lookups.

Couldn't the parameters just go in here:

virtual_ecosystem/virtual_ecosystem/models/soil/microbial_groups.py

Lines 11 to 17 in 303661e

@dataclass(frozen=True)

class MicrobialGroupConstants:

"""Container for the set of constants associated with a microbial functional group.

This sets out the constants which must be defined for each microbial functional

group.

"""

At the moment they could go in there. The issue is that down the road the microbial groups will get further differentiated, and adding an extra functional group shouldn't necessarily imply that extra enzymes need to be added (e.g. I really don't see a value in us trying to distinguish enzymes produced by saprotrophic fungi from those produced by mycorrhizal fungi).

I suppose we could make another data class for the different enzymes, and then add something to the MicrobialGroupConstants class to capture which (if any) enzymes each functional group produces. Do you reckon that would that resolve the repetitive nature of some of the code?

davidorme · 2025-03-06T14:21:25Z

I guess it seems sensible to me to have parameters for the different microbial groups come in via the same mechanism.

It does get complicated if you don't need the full "table" of parameters for every group. You'd have to make attributes that aren't required in all classes optional.

There is then a question about enforcement. With a dataclass, we could add a __post_init__ method that checks the required values are present? Or alternatively we could make a base class and then have the different groups extend each dataclass with the parameters specifically needed for that dataclass. That feels a little heavy handed.

Would you ever have a situation where a parameter must be identical between two groups? It could always be configured as identical but would we ever need to enforce that?

davidorme · 2025-03-06T14:36:27Z

This is a sketch - I kind of feel like subclasses would be more canonical though!

from dataclasses import dataclass
from typing import Optional, ClassVar


@dataclass(frozen=True)
class MFT:
    name: str
    a: float
    b: Optional[float] = None

    required: ClassVar[dict] = {
        "aaa": ["a", "b"],
        "bbb": ["a"],
    }

    def __post_init__(self):
        if self.name not in self.required:
            raise ValueError(f"Unknown name value: {self.name}")

        not_populated = [
            attr for attr in self.required[self.name] if getattr(self, attr) is None
        ]

        if not_populated:
            raise ValueError(
                f"The following attributes cannot be undefined for {self.name}: "
                f"{', '.join(not_populated)}"
            )


MFT(name="aab", a=1, b=1)  # Fails - unknown name
MFT(name="aaa", a=1, b=1)  # Good
MFT(name="aaa", a=1)  # Fails, missing b
MFT(name="bbb", a=1)  # Good

jacobcook1995 · 2025-03-06T14:52:41Z

I guess it seems sensible to me to have parameters for the different microbial groups come in via the same mechanism.

I guess the problem is that some of the enzymes will be shared between functional groups, e.g. I want to differentiate enzymes based on substrate and whether they were produced by a prokaryote or a eukaryote. Parameterising differences between enzymes based on the specific class of fungi that produced them is to my best knowledge nearly impossible, so I don't really want to introduce it as an option

davidorme · 2025-03-06T14:57:30Z

That's what I feared. You've basically got a tree of parameter definitions, some of which are defined at the tips of groups and some of which are named at internal nodes for all children of that branch. I guess the question then is how much is it worth trying to capture that structure within the code as opposed to just defining a load of parameters and then having overlapping code.

dalonsoa · 2025-03-07T08:49:11Z

An interesting problem the one you have here...

I think, based on what I understand of the problem, subclassing might be the right way to go as it allows, essentially, an infinitely deep tree of parameter definitions with branching and overriding as needed, and with all the children having access to the parameters of the parents. Not sure if dataclasses would be the way to go, though, as you might need a more complex logic to handle arguments with default values.

…lasses

…coded logic

…anges function

jacobcook1995 · 2025-03-10T10:59:58Z

@dalonsoa @davidorme, I've tried to address your comments by implementing a new EnzymeConstants data class. This stores the constants related to each enzyme. At the moment it doesn't interact with the MicrobialGroupConstants dataclass in anyway, but in future I plan to link them by each MicrobialGroupConstants instance having a list of enzymes it produces (and this logic can be validated). I think that makes sense for a future PR though.

I've also tried to refactor the two worst offenders on the "dense hardcoded logic front" which were calculate_enzyme_mediated_rates and calculate_enzyme_changes. There are other areas that can be improved but I think it makes sense to leave them for now as this is code that I'll have to refactor anyway for subsequent PRs.

Anyway let me know if you think that my approach seems sensible or not!

…ferent-types-of-enzyme

jacobcook1995 added 3 commits March 3, 2025 09:20

Seperated enzyme pools by the group that produces them

b71c5f6

Added fungal production of fungal enzyme class (instead of bacterial …

04bdc41

…enzymes)

Added action of fungally produced enzymes to calculation of POM and M…

c496ffe

…AOM breakdown rates

jacobcook1995 linked an issue Mar 5, 2025 that may be closed by this pull request

Bacteria and fungi should produce different types of enzyme #761

Open

jacobcook1995 requested review from davidorme, dalonsoa, sallymatson and hrlai March 5, 2025 10:04

Merge branch 'develop' into 761-bacteria-and-fungi-should-produce-dif…

d1aeb69

…ferent-types-of-enzyme

dalonsoa requested changes Mar 5, 2025

View reviewed changes

jacobcook1995 added 3 commits March 7, 2025 11:16

Made a new dataclass to store information about the specific enzyme c…

2c71c80

…lasses

Changed calculate_enzyme_mediated_rates function to rely less on hard…

519bb4d

…coded logic

Reduced the amount of hardcoded logic used in the calculate_enzyme_ch…

0390de4

…anges function

jacobcook1995 requested a review from dalonsoa March 10, 2025 10:53

Merge branch 'develop' into 761-bacteria-and-fungi-should-produce-dif…

fc4c04a

…ferent-types-of-enzyme

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate enzymes by whether they are produced by bacteria or fungi #766

Separate enzymes by whether they are produced by bacteria or fungi #766

jacobcook1995 commented Mar 5, 2025

codecov-commenter commented Mar 5, 2025 •

edited

Loading

dalonsoa left a comment

jacobcook1995 commented Mar 5, 2025 •

edited

Loading

jacobcook1995 commented Mar 5, 2025 •

edited

Loading

davidorme commented Mar 5, 2025

davidorme commented Mar 5, 2025

davidorme commented Mar 5, 2025

jacobcook1995 commented Mar 5, 2025 •

edited

Loading

davidorme commented Mar 6, 2025

davidorme commented Mar 6, 2025

jacobcook1995 commented Mar 6, 2025

davidorme commented Mar 6, 2025

dalonsoa commented Mar 7, 2025

jacobcook1995 commented Mar 10, 2025 •

edited

Loading

Separate enzymes by whether they are produced by bacteria or fungi #766

Are you sure you want to change the base?

Separate enzymes by whether they are produced by bacteria or fungi #766

Conversation

jacobcook1995 commented Mar 5, 2025

Description

Type of change

Key checklist

Further checks

codecov-commenter commented Mar 5, 2025 • edited Loading

Codecov Report

dalonsoa left a comment

Choose a reason for hiding this comment

jacobcook1995 commented Mar 5, 2025 • edited Loading

jacobcook1995 commented Mar 5, 2025 • edited Loading

davidorme commented Mar 5, 2025

davidorme commented Mar 5, 2025

davidorme commented Mar 5, 2025

jacobcook1995 commented Mar 5, 2025 • edited Loading

davidorme commented Mar 6, 2025

davidorme commented Mar 6, 2025

jacobcook1995 commented Mar 6, 2025

davidorme commented Mar 6, 2025

dalonsoa commented Mar 7, 2025

jacobcook1995 commented Mar 10, 2025 • edited Loading

codecov-commenter commented Mar 5, 2025 •

edited

Loading

jacobcook1995 commented Mar 5, 2025 •

edited

Loading

jacobcook1995 commented Mar 5, 2025 •

edited

Loading

jacobcook1995 commented Mar 5, 2025 •

edited

Loading

jacobcook1995 commented Mar 10, 2025 •

edited

Loading