This project involves an ETL (Extract, Transform, Load) pipeline that processes web browsing data. The main tasks include merging CSV files, transforming data, and ensuring data quality before analysis.
- JSON Files: Input files are located in the
Data
folder. - CSV Files: Output CSV files are saved in the
output_directory
folder. - Merged CSV File: The final merged data is saved as
final_file.csv
in themerged_output
folder.
-
merge_csvs.py
Merges multiple CSV files from the
output_directory
into a single CSV file. The merged file is saved tomerged_output/final_file.csv
. -
data_transformer.py
This module handles data transformation tasks, including converting long URLs to short formats and updating the
operating_sys
field. -
load_csv_to_sqlite.py
Loads the transformed CSV data into the SQLite database (
database.db
). -
check_transformations.py
Checks if the data transformations are applied correctly. This includes verifying the
operating_sys
field and removing duplicates from the database. -
exploration.py
Reads the merged CSV file and performs data exploration tasks using Pandas. This includes generating statistics and visualizations of the data.
-
main.py
The main entry point of the application that orchestrates the reading, transforming, and writing of data, including loading data into the SQLite database.
-
utils/helpers.py
Contains utility functions that assist with transformation tasks in the project.
- Database File: After merging the CSV files, the data is loaded into an SQLite database named
database.db
located in thefinal_data_to_sqlite
folder. The database schema and table definitions are included in thesqlite_shema.py
file. - Database:
-
Clone the Repository
git clone <repository-url> cd <repository-directory>
-
Install Dependencies
Ensure you have the required Python libraries installed. You can install them using pip:
pip install pandas matplotlib sqlalchemy
-
Install SQLite
Ensure SQLite is installed on your system.
To run the scheduled script, use the provided schedule.sh
script:
./schedule.sh Data usa_gov_click_data4.json usa_gov_click_data5.json usa_gov_click_data7.json
To execute both scripts in sequence and measure execution time, use the provided run_all.sh
script:
./run_all.sh
This script will execute merge_csvs.py
and exploration.py
, printing the duration of each script and the total execution time.
The merged CSV file final_file.csv
and the SQLite database contain the following columns:
web_browser
: Browser usedoperating_sys
: Operating systemfrom_url
: URL from which the user arrivedto_url
: URL to which the user navigatedcity
: City of the userlatitude
: Latitude of the user's locationlongitude
: Longitude of the user's locationtime_zone
: Time zone of the usertime_in
: Timestamp when the user arrivedtime_out
: Timestamp when the user left
Integrated Power BI to visualize the final dataset.
This includes creating various reports and dashboards to analyze key metrics and insights: