Redshift_ELT_Pipeline

Author: Hiroaki Oshima
Date: Jan 4, 2021

About This Project

I implemented this project as a part of Udacity Data Engineering Nanodegree program. This project is a data pipeline/ workflow management implemented with widely used open-source software Apache Airflow. It will sequentially execute an Extract Load Transform Process (ELT) step by step and manage its workflow. Airflow will load log data of application and song data from AWS S3 into Data Warehouse on AWS Redshift. On Redshift server, airflow will transform the raw data and store into more normalized tables such as songs, users, artists, or dates so that analytics team will be able to run various queries. Finally, upon finishing the transformation, airflow will run unit tests on each table that was created to ensure the data quality is good

Technologies Used: Apache Airflow, AWS Redshift, S3 & IAM

Files

DAG

dags/aws_etl.py: airflow dag files contain every task and their dependencies

Operators

plugins/myoperators/stage_redshift.py: Custome operator which stage the log data on S3 server on Redshift Server
plugins/myoperators/load_fact: Custome operator that transform the staged log data into fact tables
plugins/myoperators/load_dimensions: Custom operator that transform the staged tables into dimension tables
plugins/myoperators/data_quality: Ensure given columns of given tables were transformed successfully

Data Pipeline Workflow

Dag Image: Before Running Airflow: I created a data warehouse cluster on AWS Redshift and assign the host address into an airflow variable along with the security credentials

Create Tables uses postgres operator to create tables on AWS Redshift and drop old tables if exists. The SQL queries and detail schemas is in plugin/helpers/sql_queries.py

Stage_tables load log and song JSON data onto Redshift server from S3

Load_fact_table transform the staged data into the fact table songplays

Load_dim_table creates 4 dimension tables transformed from the fact table

Run_table_quality_check make sure the table that was just created is not broken

Table Schemas

Fact Table

songplays - records in event data associated with song plays i.e. records with page NextSong
- songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension Tables

users - users in the app
- user_id, first_name, last_name, gender, level
songs - songs in music database
- song_id, title, artist_id, year, duration
artists - artists in music database
- artist_id, name, location, lattitude, longitude
time - timestamps of records in songplays broken down into specific units
- start_time, hour, day, week, month, year, weekday

There is one fact table, songplay, which is the log data of users' playing songs on apps. There are four in dimension tables which individually store information abour users, songs, artists, and time. This normalized format allow analytics team to run various queries, such as they can specify the gender of users, location or times when they play songs.

Written with StackEdit.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
airflow		airflow
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Redshift_ELT_Pipeline

About This Project

Files

DAG

Operators

Data Pipeline Workflow

Table Schemas

Fact Table

Dimension Tables

About

Releases

Packages

Languages

Hiroaki0422/Airflow_ELT_Pipeline

Folders and files

Latest commit

History

Repository files navigation

Redshift_ELT_Pipeline

About This Project

Files

DAG

Operators

Data Pipeline Workflow

Table Schemas

Fact Table

Dimension Tables

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages