Skip to content

Hiroaki0422/Airflow_ELT_Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

Redshift_ELT_Pipeline


Author: Hiroaki Oshima
Date: Jan 4, 2021

About This Project



I implemented this project as a part of Udacity Data Engineering Nanodegree program. This project is a data pipeline/ workflow management implemented with widely used open-source software Apache Airflow. It will sequentially execute an Extract Load Transform Process (ELT) step by step and manage its workflow. Airflow will load log data of application and song data from AWS S3 into Data Warehouse on AWS Redshift. On Redshift server, airflow will transform the raw data and store into more normalized tables such as songs, users, artists, or dates so that analytics team will be able to run various queries. Finally, upon finishing the transformation, airflow will run unit tests on each table that was created to ensure the data quality is good

Technologies Used: Apache Airflow, AWS Redshift, S3 & IAM

Files

DAG

  1. dags/aws_etl.py: airflow dag files contain every task and their dependencies

Operators

  1. plugins/myoperators/stage_redshift.py: Custome operator which stage the log data on S3 server on Redshift Server

  2. plugins/myoperators/load_fact: Custome operator that transform the staged log data into fact tables

  3. plugins/myoperators/load_dimensions: Custom operator that transform the staged tables into dimension tables

  4. plugins/myoperators/data_quality: Ensure given columns of given tables were transformed successfully

Data Pipeline Workflow

Dag Image: dag image Before Running Airflow: I created a data warehouse cluster on AWS Redshift and assign the host address into an airflow variable along with the security credentials

Create Tables uses postgres operator to create tables on AWS Redshift and drop old tables if exists. The SQL queries and detail schemas is in plugin/helpers/sql_queries.py

Stage_tables load log and song JSON data onto Redshift server from S3

Load_fact_table transform the staged data into the fact table songplays

Load_dim_table creates 4 dimension tables transformed from the fact table

Run_table_quality_check make sure the table that was just created is not broken

Table Schemas

Fact Table

  1. songplays - records in event data associated with song plays i.e. records with page NextSong
    • songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension Tables

  1. users - users in the app
    • user_id, first_name, last_name, gender, level
  2. songs - songs in music database
    • song_id, title, artist_id, year, duration
  3. artists - artists in music database
    • artist_id, name, location, lattitude, longitude
  4. time - timestamps of records in songplays broken down into specific units
    • start_time, hour, day, week, month, year, weekday

There is one fact table, songplay, which is the log data of users' playing songs on apps. There are four in dimension tables which individually store information abour users, songs, artists, and time. This normalized format allow analytics team to run various queries, such as they can specify the gender of users, location or times when they play songs.

Written with StackEdit.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages