newswatch.live - real-time word frequency visualisation for 🇬🇧 and 🇺🇸 news
Newswatch collects data from leading UK and US news websites' front pages and creates visualisations to track word frequency changes over time, providing insights into current news trends and topics.
Data is available from:
2020-06-05in the UK 🇬🇧 and2023-09-01in the US 🇺🇸
High-level architecture overview:
To enhance dashboard performance, the data tables are partitioned by date, and Google's BI Engine is used for faster query processing.
To reduce noise in the charts and present smoother trends, a moving average of word frequencies is calculated over the last 24 hours. This calculation runs hourly through a scheduled query.
This requires AWS credentials.
./src/scripts/run_lambda.sh <start-date> <end-date> <region> <function-name> <s3-bucket> <s3-prefix>Examples:
./src/scripts/run_lambda.sh \
"2023-06-10" \
"2023-06-20" \
"eu-west-1" \
"newswatch-load-live-uk" \
"s3-example-newswatch-live" \
"word-frequencies"
./src/scripts/run_lambda.sh \
"2023-09-01" \
"2023-09-02" \
"us-east-1" \
"newswatch-load-live-us" \
"s3-example-newswatch-live-us" \
"word-frequencies"Local execution requires AWS credentials and certain environment variables.
Each variable is required for specific stages of execution as noted below:
- TEST_S3_BUCKET_NAME: all
- SITES_YAML_PATH: extract
- TEST_S3_EXTRACT_KEY: extract, transform
- TEST_S3_TRANSFORM_KEY: transform, load
- MIN_WORD_LENGTH: load
- MIN_FREQUENCY: load
- EXCLUDED_WORDS_TXT_PATH: load
Specific stages can be executed by the following commands:
uv run ./src/newswatch/extract.py
uv run ./src/newswatch/transform.py
uv run ./src/newswatch/load.pyNote: When run locally, load.py does not connect to BigQuery.


