All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Podcast notes timestamps for each title
- Author first and last names reversed
- Podcast scripts double author names
- Podcasts generated automatically daily
- Podcast model
- Podcasts audio generation manual trigger
- Persist summaries uses DAOs to store records
- Processing DAG persists arXiv research to S3 and neo4j
- neo4j seeding script refactored to DAG
- arXiv research processing Lambdas
- Use single host in dev for orchestration and neo4j
- Daily arXiv research DAG publishes to Kafka
- Orchestration host init and service deployments re-factored for modularity
- Kafka single node cluster
- UI for Apache Kafka
- Phase 3 architecture started
- Apache Airflow orchestrates data pipeline
- Pipeline can ingest multiple arXiv sets (i.e. computer science, physics, etc.)
- arXiv summaries are fetched by Airflow DAG
- fetch_daily_arxiv_summaries lambda
- Phase 2 architecture complete
- ETL works for arXiv records using Lambdas
- DAO/models for all major ETL pipeline entities
- Migrated to neo4j for research data and data lineage
- RDS
- Working prototype from Jupyter notebook
- Integrated with RDS and S3
- Research summary themes generated by OpenAI
- research_fetch_status Lambda implemented
- Initial project setup
- Phase 1 core infrastructure