create DDD table scraped from WHO website #393

saywurdson · 2025-06-09T06:08:09Z

Explanation

Created a table with scraped DDD data from the WHO website https://atcddd.fhi.no/atc_ddd_index/
Created a staging dbt model linking the ATC codes to the ingredient level RxCUIs
Created an intermediate dbt model the contains the items from the int_rxnorm_clinical_products_to_ingredient_components model and the DDD table, but only the appropriate DDD based on the dose form

Copilot

Pull Request Overview

This PR introduces a pipeline to scrape Defined Daily Dose (DDD) data from the WHO ATC/DDD website, load it into your data lake, and expose it via dbt models.

Adds a new dbt source and staging model to pull raw DDD records and link them to RxCUIs
Builds an intermediate dbt model mapping clinical products to the correct DDD based on dose form
Creates Airflow tasks and a DAG to orchestrate scraping, JSON export, Postgres load, and subsequent dbt transformations

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
dbt/sagerx/models/staging/atcddd/_atcddd__sources.yml	Define `atc_ddd` source pointing to the landing table
dbt/sagerx/models/staging/atcddd/stg_atcddd__ingredient_to_ddd.sql	New staging model selecting DDD data and joining to RxNorm names
dbt/sagerx/models/staging/atcddd/_atcddd__models.yml	Schema file for the staging model columns
dbt/sagerx/models/intermediate/atcddd/int_atcddd_clinical_products_to_ddd.sql	Intermediate model joining clinical products to DDD with dose form
dbt/sagerx/models/intermediate/atcddd/_int_atcddd__models.yml	Schema file for the intermediate model columns
airflow/dags/atcddd/dag_tasks.py	Scraper class and Airflow tasks to fetch, store, and load DDD data
airflow/dags/atcddd/dag.py	DAG definition orchestration extract → load → transform

Comments suppressed due to low confidence (5)

dbt/sagerx/models/staging/atcddd/_atcddd__models.yml:4

[nitpick] Add descriptions for this model and its columns to improve documentation and clarity for downstream consumers.

- name: stg_atcddd__ingredient_to_ddd

dbt/sagerx/models/intermediate/atcddd/_int_atcddd__models.yml:4

[nitpick] Provide column descriptions for this intermediate model to ensure its purpose and data types are well documented.

- name: int_atcddd_clinical_products_to_ddd

dbt/sagerx/models/staging/atcddd/stg_atcddd__ingredient_to_ddd.sql:1

Consider adding dbt tests (e.g. not_null and unique) for critical columns like rxcui, atc_code, and ddd to ensure data quality.

select

airflow/dags/atcddd/dag_tasks.py:65

The indentation here is inconsistent with the surrounding block and may cause a Python syntax error; align it to match the elif level above.

                 content_p = content_paragraphs[0]

dbt/sagerx/models/intermediate/atcddd/int_atcddd_clinical_products_to_ddd.sql:87

[nitpick] Route codes are mixed-case elsewhere; consider standardizing the casing for 'implant' (e.g. 'I' or uppercase) to prevent join mismatches.

    union all select 'implant', 'Drug Implant'

Copilot · 2025-06-09T06:11:09Z

dbt/sagerx/models/intermediate/atcddd/_int_atcddd__models.yml

+      - name: clinical_product_name
+      - name: clinical_product_tty
+      - name: clinical_product_component_rxcui
+      - name: clinical_product_compnent_name


There is a typo in the column name; it should be clinical_product_component_name to match the actual field and avoid schema mismatches.

Suggested change

- name: clinical_product_compnent_name

- name: clinical_product_component_name

Copilot · 2025-06-09T06:11:10Z

airflow/dags/atcddd/dag.py

+@dag(
+    dag_id=dag_id,
+    schedule_interval="0 3 15 * *",  # Runs on the 15th of each month at 3 AM
+    start_date=pendulum.today('UTC').add(days=-1),


Use a fixed start_date rather than a dynamic pendulum.today() to ensure DAG scheduling is stable and repeatable.

Suggested change

start_date=pendulum.today('UTC').add(days=-1),

start_date=pendulum.datetime(2023, 1, 1, tz="UTC"),

Copilot · 2025-06-09T06:11:10Z

airflow/dags/atcddd/dag_tasks.py

+    print(f"Extraction Completed! Data saved to file: {file_path_str}")
+    print(f"Total records scraped: {len(results)}")


Replace print statements with logger.info to maintain consistent logging and avoid mixing stdout with log records.

Suggested change

print(f"Extraction Completed! Data saved to file: {file_path_str}")

print(f"Total records scraped: {len(results)}")

logger.info(f"Extraction Completed! Data saved to file: {file_path_str}")

logger.info(f"Total records scraped: {len(results)}")

saywurdson added 3 commits June 7, 2025 00:13

begin atc ddd model

f1a07e4

finish atcddd code to create table and dbt models

531f8ed

fix dbt issues. int table is now available

d983767

saywurdson requested review from jrlegrand and Copilot June 9, 2025 06:08

saywurdson self-assigned this Jun 9, 2025

Copilot AI reviewed Jun 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

create DDD table scraped from WHO website #393

create DDD table scraped from WHO website #393

Uh oh!

saywurdson commented Jun 9, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jun 9, 2025

Uh oh!

Copilot AI Jun 9, 2025

Uh oh!

Copilot AI Jun 9, 2025

Uh oh!

Uh oh!

	- name: clinical_product_compnent_name
	- name: clinical_product_component_name

	start_date=pendulum.today('UTC').add(days=-1),
	start_date=pendulum.datetime(2023, 1, 1, tz="UTC"),

		print(f"Extraction Completed! Data saved to file: {file_path_str}")
		print(f"Total records scraped: {len(results)}")

Uh oh!

create DDD table scraped from WHO website #393

Are you sure you want to change the base?

create DDD table scraped from WHO website #393

Uh oh!

Conversation

saywurdson commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Explanation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

saywurdson commented Jun 9, 2025 •

edited

Loading