ETL_US_Gov_Project

This project involves an ETL (Extract, Transform, Load) pipeline that processes web browsing data. The main tasks include merging CSV files, transforming data, and ensuring data quality before analysis.

Project Overview

1. Data Files

JSON Files: Input files are located in the Data folder.
CSV Files: Output CSV files are saved in the output_directory folder.
Merged CSV File: The final merged data is saved as final_file.csv in the merged_output folder.

2. Scripts

merge_csvs.py

Merges multiple CSV files from the output_directory into a single CSV file. The merged file is saved to merged_output/final_file.csv.
data_transformer.py

This module handles data transformation tasks, including converting long URLs to short formats and updating the operating_sys field.
load_csv_to_sqlite.py

Loads the transformed CSV data into the SQLite database (database.db).
check_transformations.py

Checks if the data transformations are applied correctly. This includes verifying the operating_sys field and removing duplicates from the database.
exploration.py

Reads the merged CSV file and performs data exploration tasks using Pandas. This includes generating statistics and visualizations of the data.
main.py

The main entry point of the application that orchestrates the reading, transforming, and writing of data, including loading data into the SQLite database.
utils/helpers.py

Contains utility functions that assist with transformation tasks in the project.

3. Database Integration

Database File: After merging the CSV files, the data is loaded into an SQLite database named database.db located in the final_data_to_sqlite folder. The database schema and table definitions are included in the sqlite_shema.py file.
Database:

4. Setup and Installation

Clone the Repository

git clone <repository-url>
cd <repository-directory>

Install Dependencies

Ensure you have the required Python libraries installed. You can install them using pip:
```
pip install pandas matplotlib sqlalchemy
```
Install SQLite

Ensure SQLite is installed on your system.

Automated Execution

To run the scheduled script, use the provided schedule.sh script:

./schedule.sh Data usa_gov_click_data4.json usa_gov_click_data5.json usa_gov_click_data7.json

To execute both scripts in sequence and measure execution time, use the provided run_all.sh script:

./run_all.sh

This script will execute merge_csvs.py and exploration.py, printing the duration of each script and the total execution time.

Data Schema

The merged CSV file final_file.csv and the SQLite database contain the following columns:

web_browser: Browser used
operating_sys: Operating system
from_url: URL from which the user arrived
to_url: URL to which the user navigated
city: City of the user
latitude: Latitude of the user's location
longitude: Longitude of the user's location
time_zone: Time zone of the user
time_in: Timestamp when the user arrived
time_out: Timestamp when the user left

Power BI

Integrated Power BI to visualize the final dataset.

This includes creating various reports and dashboards to analyze key metrics and insights:

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
Data		Data
Data_Exploration		Data_Exploration
__pycache__		__pycache__
final_data_in_sqlite		final_data_in_sqlite
imgs		imgs
merged_output		merged_output
output		output
output_directory		output_directory
utils		utils
PowerBi Dashboards.pbix		PowerBi Dashboards.pbix
README.md		README.md
data_transformer.py		data_transformer.py
data_writer.py		data_writer.py
exploration.py		exploration.py
file_reader.py		file_reader.py
main.py		main.py
merge_csvs.py		merge_csvs.py
run_all.sh		run_all.sh
shedule_bash.sh		shedule_bash.sh
will help.txt		will help.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ETL_US_Gov_Project

Project Overview

1. Data Files

2. Scripts

3. Database Integration

4. Setup and Installation

Automated Execution

Data Schema

Power BI

About

Releases

Packages

Languages

codsalah/ETL_US_gov_project

Folders and files

Latest commit

History

Repository files navigation

ETL_US_Gov_Project

Project Overview

1. Data Files

2. Scripts

3. Database Integration

4. Setup and Installation

Automated Execution

Data Schema

Power BI

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages