Skip to content

Project using BigQuery, SQL, Feature Engineering, and XGBoost to detect early signals of a customer churn, with Google Cloud deploy

Notifications You must be signed in to change notification settings

adanSiqueira/customer-churn-prediction

Repository files navigation

🛍️ Customer Churn Prediction in eCommerce with Google Cloud

Python SQL Pandas Matplotlib Seaborn Scikit-learn

Predicting when a customer is likely to stop buying is one of the most critical insights for any subscription-based or transactional business. This project uses real-world eCommerce data to develop a machine learning model capable of identifying churn risk — helping company to take action before losing valuable clients.


🚀 Project Overview

Goal:
Build a machine learning model to predict customer churn and uncover retention insights based on behavioral data from the TheLook eCommerce dataset.

Business Impact:
By identifying customers likely to churn, marketing teams can implement re-engagement strategies and loyalty campaigns, increasing Customer Lifetime Value (CLV) and Revenue Retention.


📦 Dataset: TheLook eCommerce (BigQuery Public Data)

The dataset consists of 7 structured tables related to users, orders, products, inventory, and transactions. Data was extracted using custom SQL queries, merged in Python (Pandas), and cleaned for modeling.

Key features include:

  • Customer demographics (age, gender, location)
  • Purchase behavior (order frequency, spend, recency)
  • Product types and categories
  • Delivery and return timestamps

🔧 Tools & Techniques

Area Tools / Methods Used
Data Extraction SQL on BigQuery
Data Wrangling Pandas, NumPy
Churn Definition Recency logic & Kaplan-Meier survival modeling
Exploratory Analysis Seaborn, Matplotlib, descriptive statistics
Survival Analysis Kaplan-Meier Estimator (lifelines)
Modeling (optional) XGBoost
Evaluation ROC-AUC, Confusion Matrix, Precision-Recall

📊 Key Analyses & Findings

  • Kaplan-Meier survival curve shows that ~50% of customers never return after their first purchase.
  • Customers who make a second purchase are far more likely to remain active for extended periods (500+ days).
  • A fixed churn threshold (e.g. 90 days) may underestimate customer lifetime for loyal buyers — suggesting the need for time-aware churn models.

📁 Folder Structure

├──
│ ├── 01_data preparation.ipynb     -> extraction from BigQuery, Feature Engineering and Dataset Consolidationg
│ ├── 02_model_development.ipynb    -> XGBoost training, parameters optimzation and validation
│ └── 03_model_interpretation.ipynb -> Interpreting model's results, Feature Imporance and SHAP
├── processed_data/
│ └── clients_info.csv
├── model
│ └── churn_model.pkl
├── app.py                       -> Streamlit interface as model's deploy
├── README.md
└── requirements.txt

✅ Skills Demonstrated

  • SQL data extraction from public cloud datasets
  • End-to-end churn analysis using Python
  • Kaplan-Meier and survival modeling
  • Feature engineering for customer behavior
  • Business-driven data storytelling and interpretation

Local Deployment with Streamlit

This project includes a web application built using Streamlit, allowing you to interact with the churn prediction model directly from your browser.

▶️ How to Run Locally

  1. Install Dependencies
    Make sure you have Python installed (version 3.8 or higher). Then, install the required packages with:

    pip install -r requirements.txt
    
  2. Run the App
    In the root directory of the project, run:

    streamlit run app.py
    
  3. Access the app
    After executing the command above, Streamlit will automatically start a local server and display a URL in your terminal, such as:

    http://localhost:8501
    

App Features (Technical Overview)

  • 🧠 Model Deployment: Integrates a production-ready XGBoost classification model for churn prediction
  • 🧾 Manual Data Input: Accepts user-defined inputs including age, gender, number_of_orders, and total_spent
  • 🧮 Dynamic Feature Engineering: Automatically computes average_ticket as a derived feature (total_spent / number_of_orders)
  • 📈 Churn Inference: Outputs binary churn prediction (0 = active, 1 = churn) in real time
  • 🧠 Model Explainability: Integrates SHAP (SHapley Additive Explanations) to generate global and local interpretability visualizations
  • 📊 Visual Insights: Includes force plots and summary plots to showcase feature impact on predictions
  • 🚀 End-to-End Pipeline: Demonstrates the full ML lifecycle — from data preprocessing to model inference and explainability — in a single interactive interface

About

Project using BigQuery, SQL, Feature Engineering, and XGBoost to detect early signals of a customer churn, with Google Cloud deploy

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published