Skip to content

create DDD table scraped from WHO website #393

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

saywurdson
Copy link
Collaborator

@saywurdson saywurdson commented Jun 9, 2025

Explanation

  1. Created a table with scraped DDD data from the WHO website https://atcddd.fhi.no/atc_ddd_index/
  2. Created a staging dbt model linking the ATC codes to the ingredient level RxCUIs
  3. Created an intermediate dbt model the contains the items from the int_rxnorm_clinical_products_to_ingredient_components model and the DDD table, but only the appropriate DDD based on the dose form

@saywurdson saywurdson requested review from jrlegrand and Copilot June 9, 2025 06:08
@saywurdson saywurdson self-assigned this Jun 9, 2025
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a pipeline to scrape Defined Daily Dose (DDD) data from the WHO ATC/DDD website, load it into your data lake, and expose it via dbt models.

  • Adds a new dbt source and staging model to pull raw DDD records and link them to RxCUIs
  • Builds an intermediate dbt model mapping clinical products to the correct DDD based on dose form
  • Creates Airflow tasks and a DAG to orchestrate scraping, JSON export, Postgres load, and subsequent dbt transformations

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
dbt/sagerx/models/staging/atcddd/_atcddd__sources.yml Define atc_ddd source pointing to the landing table
dbt/sagerx/models/staging/atcddd/stg_atcddd__ingredient_to_ddd.sql New staging model selecting DDD data and joining to RxNorm names
dbt/sagerx/models/staging/atcddd/_atcddd__models.yml Schema file for the staging model columns
dbt/sagerx/models/intermediate/atcddd/int_atcddd_clinical_products_to_ddd.sql Intermediate model joining clinical products to DDD with dose form
dbt/sagerx/models/intermediate/atcddd/_int_atcddd__models.yml Schema file for the intermediate model columns
airflow/dags/atcddd/dag_tasks.py Scraper class and Airflow tasks to fetch, store, and load DDD data
airflow/dags/atcddd/dag.py DAG definition orchestration extract → load → transform
Comments suppressed due to low confidence (5)

dbt/sagerx/models/staging/atcddd/_atcddd__models.yml:4

  • [nitpick] Add descriptions for this model and its columns to improve documentation and clarity for downstream consumers.
- name: stg_atcddd__ingredient_to_ddd

dbt/sagerx/models/intermediate/atcddd/_int_atcddd__models.yml:4

  • [nitpick] Provide column descriptions for this intermediate model to ensure its purpose and data types are well documented.
- name: int_atcddd_clinical_products_to_ddd

dbt/sagerx/models/staging/atcddd/stg_atcddd__ingredient_to_ddd.sql:1

  • Consider adding dbt tests (e.g. not_null and unique) for critical columns like rxcui, atc_code, and ddd to ensure data quality.
select

airflow/dags/atcddd/dag_tasks.py:65

  • The indentation here is inconsistent with the surrounding block and may cause a Python syntax error; align it to match the elif level above.
                 content_p = content_paragraphs[0]

dbt/sagerx/models/intermediate/atcddd/int_atcddd_clinical_products_to_ddd.sql:87

  • [nitpick] Route codes are mixed-case elsewhere; consider standardizing the casing for 'implant' (e.g. 'I' or uppercase) to prevent join mismatches.
    union all select 'implant', 'Drug Implant'

- name: clinical_product_name
- name: clinical_product_tty
- name: clinical_product_component_rxcui
- name: clinical_product_compnent_name
Copy link
Preview

Copilot AI Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a typo in the column name; it should be clinical_product_component_name to match the actual field and avoid schema mismatches.

Suggested change
- name: clinical_product_compnent_name
- name: clinical_product_component_name

Copilot uses AI. Check for mistakes.

@dag(
dag_id=dag_id,
schedule_interval="0 3 15 * *", # Runs on the 15th of each month at 3 AM
start_date=pendulum.today('UTC').add(days=-1),
Copy link
Preview

Copilot AI Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a fixed start_date rather than a dynamic pendulum.today() to ensure DAG scheduling is stable and repeatable.

Suggested change
start_date=pendulum.today('UTC').add(days=-1),
start_date=pendulum.datetime(2023, 1, 1, tz="UTC"),

Copilot uses AI. Check for mistakes.

Comment on lines +268 to +269
print(f"Extraction Completed! Data saved to file: {file_path_str}")
print(f"Total records scraped: {len(results)}")
Copy link
Preview

Copilot AI Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace print statements with logger.info to maintain consistent logging and avoid mixing stdout with log records.

Suggested change
print(f"Extraction Completed! Data saved to file: {file_path_str}")
print(f"Total records scraped: {len(results)}")
logger.info(f"Extraction Completed! Data saved to file: {file_path_str}")
logger.info(f"Total records scraped: {len(results)}")

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant