-
-
Notifications
You must be signed in to change notification settings - Fork 18
create DDD table scraped from WHO website #393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a pipeline to scrape Defined Daily Dose (DDD) data from the WHO ATC/DDD website, load it into your data lake, and expose it via dbt models.
- Adds a new dbt source and staging model to pull raw DDD records and link them to RxCUIs
- Builds an intermediate dbt model mapping clinical products to the correct DDD based on dose form
- Creates Airflow tasks and a DAG to orchestrate scraping, JSON export, Postgres load, and subsequent dbt transformations
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
dbt/sagerx/models/staging/atcddd/_atcddd__sources.yml | Define atc_ddd source pointing to the landing table |
dbt/sagerx/models/staging/atcddd/stg_atcddd__ingredient_to_ddd.sql | New staging model selecting DDD data and joining to RxNorm names |
dbt/sagerx/models/staging/atcddd/_atcddd__models.yml | Schema file for the staging model columns |
dbt/sagerx/models/intermediate/atcddd/int_atcddd_clinical_products_to_ddd.sql | Intermediate model joining clinical products to DDD with dose form |
dbt/sagerx/models/intermediate/atcddd/_int_atcddd__models.yml | Schema file for the intermediate model columns |
airflow/dags/atcddd/dag_tasks.py | Scraper class and Airflow tasks to fetch, store, and load DDD data |
airflow/dags/atcddd/dag.py | DAG definition orchestration extract → load → transform |
Comments suppressed due to low confidence (5)
dbt/sagerx/models/staging/atcddd/_atcddd__models.yml:4
- [nitpick] Add descriptions for this model and its columns to improve documentation and clarity for downstream consumers.
- name: stg_atcddd__ingredient_to_ddd
dbt/sagerx/models/intermediate/atcddd/_int_atcddd__models.yml:4
- [nitpick] Provide column descriptions for this intermediate model to ensure its purpose and data types are well documented.
- name: int_atcddd_clinical_products_to_ddd
dbt/sagerx/models/staging/atcddd/stg_atcddd__ingredient_to_ddd.sql:1
- Consider adding dbt tests (e.g.
not_null
andunique
) for critical columns likerxcui
,atc_code
, andddd
to ensure data quality.
select
airflow/dags/atcddd/dag_tasks.py:65
- The indentation here is inconsistent with the surrounding block and may cause a Python syntax error; align it to match the
elif
level above.
content_p = content_paragraphs[0]
dbt/sagerx/models/intermediate/atcddd/int_atcddd_clinical_products_to_ddd.sql:87
- [nitpick] Route codes are mixed-case elsewhere; consider standardizing the casing for
'implant'
(e.g.'I'
or uppercase) to prevent join mismatches.
union all select 'implant', 'Drug Implant'
- name: clinical_product_name | ||
- name: clinical_product_tty | ||
- name: clinical_product_component_rxcui | ||
- name: clinical_product_compnent_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a typo in the column name; it should be clinical_product_component_name
to match the actual field and avoid schema mismatches.
- name: clinical_product_compnent_name | |
- name: clinical_product_component_name |
Copilot uses AI. Check for mistakes.
@dag( | ||
dag_id=dag_id, | ||
schedule_interval="0 3 15 * *", # Runs on the 15th of each month at 3 AM | ||
start_date=pendulum.today('UTC').add(days=-1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a fixed start_date rather than a dynamic pendulum.today()
to ensure DAG scheduling is stable and repeatable.
start_date=pendulum.today('UTC').add(days=-1), | |
start_date=pendulum.datetime(2023, 1, 1, tz="UTC"), |
Copilot uses AI. Check for mistakes.
print(f"Extraction Completed! Data saved to file: {file_path_str}") | ||
print(f"Total records scraped: {len(results)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace print
statements with logger.info
to maintain consistent logging and avoid mixing stdout with log records.
print(f"Extraction Completed! Data saved to file: {file_path_str}") | |
print(f"Total records scraped: {len(results)}") | |
logger.info(f"Extraction Completed! Data saved to file: {file_path_str}") | |
logger.info(f"Total records scraped: {len(results)}") |
Copilot uses AI. Check for mistakes.
Explanation