Enriching PubMed data with current Almanack development #359

willdavidson05 · 2025-09-04T22:10:45Z

Description

This PR is using the same PubMed Data utilized in the Software Entropy analysis, but applying it to the current Almanack metrics and checks

What is the nature of your change?

Content additions or updates (adds or updates content)
Bug fix (fixes an issue).
Enhancement (adds functionality).
Breaking change (these changes would cause existing functionality to not work as expected).

Checklist

Please ensure that all boxes are checked before indicating that this pull request is ready for review.

I have read the CONTRIBUTING.md guidelines.
My code follows the style guidelines of this project.
I have performed a self-review of my own contributions.
I have commented my content, particularly in hard-to-understand areas.
I have made corresponding changes to related documentation (outside of book content).
My changes generate no new warnings.
New and existing tests pass locally with my changes.
I have added tests that prove my additions are effective or that my feature works.
I have deleted all non-relevant text in this pull request template.

d33bs

Nice job @willdavidson05 ! I left a few comments up for your discretion but overall felt this was looking good.

Once you settle on the code within the data module consider adding tests for the new functions to ensure coverage is retained. Additionally, it looked like there were a few linting checks that still need to be addressed.

d33bs · 2025-09-05T21:07:45Z

src/almanack/metrics/data.py

+def _table_to_wide(table_rows: list[dict]) -> Dict[str, Any]:
+    """
+    Transpose Almanack table (name->result), compute checks summary, flatten nested.
+    File-level entropy is completely ignored.


Why do we ignore it? Consider documenting.

I figured that file-level-entropy is too granular for the scope of this analysis. It severely increased the run time as well when I did a pilot run. Would you suggest adding it, or do you think its fine to avoid it for now?

Thanks for the clarification here! If it's a performance hinderance, consider documenting why we discard it. If you have an understanding of what makes it slower to implement, consider including that as well.

src/almanack/metrics/data.py

d33bs · 2025-09-05T21:46:32Z

src/almanack/metrics/data.py

+    # Dynamic sustainability checks: bool + positive correlation
+    mask = (df["result-type"] == "bool") & (df["sustainability_correlation"] == 1)
+    checks_total  = int(mask.sum())
+    checks_passed = int((df.loc[mask, "result"] == True).sum())  # noqa: E712


Did you find that you had to use the E712 ignore here? If possible, consider using is instead of ==.

src/almanack/metrics/data.py

src/book/seed-bank/pubmed-github-repositories/almanack_checks.ipynb

willdavidson05 added 3 commits September 3, 2025 12:30

adding GITHUB_TOKEN to api data query

d5e7a3b

Processing Almanack checks on repo

4e4769b

test run with parallelization

07c2e74

willdavidson05 changed the title ~~Enriching Almanack PubMed data with current Almanack development~~ Enriching PubMed data with current Almanack development Sep 5, 2025

d33bs self-requested a review September 5, 2025 21:05

d33bs reviewed Sep 5, 2025

View reviewed changes

willdavidson05 added 3 commits September 23, 2025 20:29

inital review

366424a

removing redundant fraction metric and adding pilot run

3ce9862

adding documentation

938d9cc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enriching PubMed data with current Almanack development #359

Enriching PubMed data with current Almanack development #359

Uh oh!

willdavidson05 commented Sep 4, 2025

Uh oh!

d33bs left a comment

Uh oh!

d33bs Sep 5, 2025

Uh oh!

willdavidson05 Sep 24, 2025

Uh oh!

d33bs Sep 27, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

d33bs Sep 5, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Enriching PubMed data with current Almanack development #359

Are you sure you want to change the base?

Enriching PubMed data with current Almanack development #359

Uh oh!

Conversation

willdavidson05 commented Sep 4, 2025

Description

What is the nature of your change?

Checklist

Uh oh!

d33bs left a comment

Choose a reason for hiding this comment

Uh oh!

d33bs Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

willdavidson05 Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

d33bs Sep 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

d33bs Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants