Skip to content

benkulcsar/newswatch

Repository files navigation

🗞️Newswatch🗞️

python Code style: black coverage License: MIT

newswatch.live - real-time word frequency visualisation for 🇬🇧 and 🇺🇸 news

Summary

Newswatch collects data from leading UK and US news websites' front pages and creates visualisations to track word frequency changes over time, providing insights into current news trends and topics.

Line Chart

"Line chart" raw data in csv

Treemap

"Treemap" raw data in csv

Historical data

Data is available from:

  • 2020-06-05 in the UK 🇬🇧 and
  • 2023-09-01 in the US 🇺🇸

Architecture

High-level architecture overview:

"Architecture"

Optimisation and Smoothing

To enhance dashboard performance, the data tables are partitioned by date, and Google's BI Engine is used for faster query processing.

To reduce noise in the charts and present smoother trends, a moving average of word frequencies is calculated over the last 24 hours. This calculation runs hourly through a scheduled query.

Runbook

Manually executing AWS lambda functions for backfilling

This requires AWS credentials.

./src/scripts/run_lambda.sh <start-date> <end-date> <region> <function-name> <s3-bucket> <s3-prefix>

Examples:

./src/scripts/run_lambda.sh \
    "2023-06-10" \
    "2023-06-20" \
    "eu-west-1" \
    "newswatch-load-live-uk" \
    "s3-example-newswatch-live" \
    "word-frequencies"

./src/scripts/run_lambda.sh \
    "2023-09-01" \
    "2023-09-02" \
    "us-east-1" \
    "newswatch-load-live-us" \
    "s3-example-newswatch-live-us" \
    "word-frequencies"

Execute and debug stages locally

Local execution requires AWS credentials and certain environment variables.

Each variable is required for specific stages of execution as noted below:

  • TEST_S3_BUCKET_NAME: all
  • SITES_YAML_PATH: extract
  • TEST_S3_EXTRACT_KEY: extract, transform
  • TEST_S3_TRANSFORM_KEY: transform, load
  • MIN_WORD_LENGTH: load
  • MIN_FREQUENCY: load
  • EXCLUDED_WORDS_TXT_PATH: load

Specific stages can be executed by the following commands:

uv run ./src/newswatch/extract.py
uv run ./src/newswatch/transform.py
uv run ./src/newswatch/load.py

Note: When run locally, load.py does not connect to BigQuery.

About

Real-time word frequencies in 🇬🇧 and 🇺🇸 news

Resources

License

Stars

Watchers

Forks

Contributors