Skip to content

Conversation

@willdavidson05
Copy link
Member

Description

This PR is using the same PubMed Data utilized in the Software Entropy analysis, but applying it to the current Almanack metrics and checks

What is the nature of your change?

  • Content additions or updates (adds or updates content)
  • Bug fix (fixes an issue).
  • Enhancement (adds functionality).
  • Breaking change (these changes would cause existing functionality to not work as expected).

Checklist

Please ensure that all boxes are checked before indicating that this pull request is ready for review.

  • I have read the CONTRIBUTING.md guidelines.
  • My code follows the style guidelines of this project.
  • I have performed a self-review of my own contributions.
  • I have commented my content, particularly in hard-to-understand areas.
  • I have made corresponding changes to related documentation (outside of book content).
  • My changes generate no new warnings.
  • New and existing tests pass locally with my changes.
  • I have added tests that prove my additions are effective or that my feature works.
  • I have deleted all non-relevant text in this pull request template.

@willdavidson05 willdavidson05 changed the title Enriching Almanack PubMed data with current Almanack development Enriching PubMed data with current Almanack development Sep 5, 2025
@d33bs d33bs self-requested a review September 5, 2025 21:05
Copy link
Member

@d33bs d33bs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job @willdavidson05 ! I left a few comments up for your discretion but overall felt this was looking good.

Once you settle on the code within the data module consider adding tests for the new functions to ensure coverage is retained. Additionally, it looked like there were a few linting checks that still need to be addressed.

def _table_to_wide(table_rows: list[dict]) -> Dict[str, Any]:
"""
Transpose Almanack table (name->result), compute checks summary, flatten nested.
File-level entropy is completely ignored.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we ignore it? Consider documenting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured that file-level-entropy is too granular for the scope of this analysis. It severely increased the run time as well when I did a pilot run. Would you suggest adding it, or do you think its fine to avoid it for now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification here! If it's a performance hinderance, consider documenting why we discard it. If you have an understanding of what makes it slower to implement, consider including that as well.

# Dynamic sustainability checks: bool + positive correlation
mask = (df["result-type"] == "bool") & (df["sustainability_correlation"] == 1)
checks_total = int(mask.sum())
checks_passed = int((df.loc[mask, "result"] == True).sum()) # noqa: E712
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you find that you had to use the E712 ignore here? If possible, consider using is instead of ==.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants