-
Notifications
You must be signed in to change notification settings - Fork 4
Enriching PubMed data with current Almanack development #359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Enriching PubMed data with current Almanack development #359
Conversation
d33bs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job @willdavidson05 ! I left a few comments up for your discretion but overall felt this was looking good.
Once you settle on the code within the data module consider adding tests for the new functions to ensure coverage is retained. Additionally, it looked like there were a few linting checks that still need to be addressed.
src/almanack/metrics/data.py
Outdated
| def _table_to_wide(table_rows: list[dict]) -> Dict[str, Any]: | ||
| """ | ||
| Transpose Almanack table (name->result), compute checks summary, flatten nested. | ||
| File-level entropy is completely ignored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we ignore it? Consider documenting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figured that file-level-entropy is too granular for the scope of this analysis. It severely increased the run time as well when I did a pilot run. Would you suggest adding it, or do you think its fine to avoid it for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification here! If it's a performance hinderance, consider documenting why we discard it. If you have an understanding of what makes it slower to implement, consider including that as well.
src/almanack/metrics/data.py
Outdated
| # Dynamic sustainability checks: bool + positive correlation | ||
| mask = (df["result-type"] == "bool") & (df["sustainability_correlation"] == 1) | ||
| checks_total = int(mask.sum()) | ||
| checks_passed = int((df.loc[mask, "result"] == True).sum()) # noqa: E712 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you find that you had to use the E712 ignore here? If possible, consider using is instead of ==.
Description
This PR is using the same PubMed Data utilized in the Software Entropy analysis, but applying it to the current Almanack metrics and checks
What is the nature of your change?
Checklist
Please ensure that all boxes are checked before indicating that this pull request is ready for review.