GitHub Insights is a feature within GitHub Enterprise that provides organizations with detailed analytics and insights into their codebase, team productivity, and workflows. It offers a data-driven view of a project’s development health, which can help teams optimize performance, identify bottlenecks, and make informed decisions.
- Measures contributions across team members, including volume and frequency, allows managers to understand team dynamics and work distribution.
- Docker Compose:
- A
docker-compose.yml
file defines how each service (Airflow, dbt, etc.) interacts. - Allows consistent setup and easy deployment of the ETL pipeline across environments.
- A
- Airflow, Snowflake connectors, and other ETL dependencies can run inside Docker containers.
- Airflow is used to schedule and manage the entire ETL process by managing dependencies between tasks.
- An Airflow DAG (Directed Acyclic Graph) is created to define the workflow steps.
- The DAG defines the sequence (e.g., Extract → Transform → Load) and schedules when each task should run.
- Handles transformation of raw data into the desired structure, applying business logic.
- Builds SQL transformations on top of raw tables, generating models and views in Snowflake.
- Process:
- Create a dbt project where you define models, schema tests, and transformation logic.
- Materialize models as tables or views in Snowflake.
- Scalable cloud data storage platform.
- Steps:
- Create a table and define its structure.
- Table structure includes:
SHA
(Secure Hash Algorithm): Represents unique commit identifier.Author
: Commit author.Date
: Commit date.Message
: Commit message, serve as documentation, explaining why changes were made and their impact.
- Extract:
- Airflow extracts data from the CSV file and temporarily stores it.
- Load (Raw Data):
- Load extracted raw CSV data into a Snowflake staging table.
- Transform (dbt):
- Transform raw data using SQL models, applying business logic.
- Load (Final Data):
- Load transformed data into a final Snowflake table.
- Visualize: Use tools like Tableau for visualization.
- The
commitsJobHandler.py
extracts data from a GitHub repository and temporarily stores it in a CSV file. - Created project folder:
GitHubInsights
.
-
Installed Docker Desktop.
-
Downloaded
docker-compose.yaml
file:curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.10.3/docker-compose.yaml'
-
Run Apache Airflow with Docker Compose:
- Ensure Docker is installed:
docker --version docker-compose --version
- Navigate to the directory with
docker-compose.yaml
:cd /path/to/your/docker-compose-file
- Start services:
docker-compose up -d
- Access Airflow UI: http://localhost:8080
- Default credentials:
- Username:
airflow
- Password:
airflow
- Username:
- Default credentials:
- Create a connector to Snowflake. Follow Airflow UI, Admin -> Add Connection, for Connection Type choose Snowflake.
- Stop services:
docker-compose down ```
- Ensure Docker is installed:
-
Folder Mapping in Docker Compose:
volumes: - ./dags:/opt/airflow/dags - ./logs:/opt/airflow/logs - ./plugins:/opt/airflow/plugins - ./dbt:/dbt
-
Use Docker Desktop Terminal to verify mappings:
cd /dbt
-
Create Snowflake connection in Airflow UI (
Admin -> Connections
, 'Connection Type - Snowflake'){ "account: "SPECIFY_YOUR_ACCOUNT_IN_SNOWFLAKE", "warehouse": "COMPUTE_WH", "database": "COMMITS" , "role": "ETL_USER", "insecure mode: False) }
-
Create in Snowflake
- User
AIRFLOW_ETL
- Role
ETL_USER
- Permissions:
GRANT USAGE ON WAREHOUSE COMPUTE_WH TO ROLE ETL_USER; GRANT USAGE ON DATABASE COMMITS TO ROLE ETL_USER; GRANT USAGE ON SCHEMA COMMITS.PUBLIC TO ROLE ETL_USER; GRANT INSERT, SELECT ON TABLE COMMITS.PUBLIC.COMMITS TO ROLE ETL_USER;
- User
-
Create Snowflake database (
commits
), schema (public
), and table (commits
):CREATE OR REPLACE TABLE COMMITS.PUBLIC.COMMITS ( SHA VARCHAR(16777216), AUTHOR VARCHAR(16777216), DATE DATE, MESSAGE VARCHAR(16777216) );
- commitsJobHandler.py
- Using wrapper around GitHub API githubClient load commit to CSV File, commits.csv.
- commits_load.py
- Define DAG 'csv_to_snowflake' that imports commits data from commits.csv to Snowflake database commits.
- Run Dag 'csv_to_snowflake' in Airflow UI.
- Check loaded data in Snowflake database using SQL command
SELECT * FROM COMMITS.PUBLIC.COMMITS
---
## TO DO
### Step 3: Transform
- Use dbt to transform data and load into Snowflake.
---
### Step 4: Visualize
- Created visualizations in **Tableau**.

---
## TO DO
### Step 5: Automate
- Automated data ingestion to upload only changes.
## TO DO
- Describe uproach to security
- CI/CD pipelines
- Scheduled incremental imports
- Monitoring and alerting
- Show current solution limitations using metrics
- Scalability to many repos, orgs, users
- GDPR
- Production infra, automated configuration
- Initial development and long term support cost
- Improve dashboard, add commits, Snowflake db schema
- Resiliency, data consistency and performance
- Add Analysis of commits data with DeepSeek AI model.