Predicting when a customer is likely to stop buying is one of the most critical insights for any subscription-based or transactional business. This project uses real-world eCommerce data to develop a machine learning model capable of identifying churn risk — helping company to take action before losing valuable clients.
Goal:
Build a machine learning model to predict customer churn and uncover retention insights based on behavioral data from the TheLook eCommerce dataset.
Business Impact:
By identifying customers likely to churn, marketing teams can implement re-engagement strategies and loyalty campaigns, increasing Customer Lifetime Value (CLV) and Revenue Retention.
The dataset consists of 7 structured tables related to users, orders, products, inventory, and transactions. Data was extracted using custom SQL queries, merged in Python (Pandas), and cleaned for modeling.
Key features include:
- Customer demographics (age, gender, location)
- Purchase behavior (order frequency, spend, recency)
- Product types and categories
- Delivery and return timestamps
Area | Tools / Methods Used |
---|---|
Data Extraction | SQL on BigQuery |
Data Wrangling | Pandas, NumPy |
Churn Definition | Recency logic & Kaplan-Meier survival modeling |
Exploratory Analysis | Seaborn, Matplotlib, descriptive statistics |
Survival Analysis | Kaplan-Meier Estimator (lifelines) |
Modeling (optional) | XGBoost |
Evaluation | ROC-AUC, Confusion Matrix, Precision-Recall |
- Kaplan-Meier survival curve shows that ~50% of customers never return after their first purchase.
- Customers who make a second purchase are far more likely to remain active for extended periods (500+ days).
- A fixed churn threshold (e.g. 90 days) may underestimate customer lifetime for loyal buyers — suggesting the need for time-aware churn models.
├──
│ ├── 01_data preparation.ipynb -> extraction from BigQuery, Feature Engineering and Dataset Consolidationg
│ ├── 02_model_development.ipynb -> XGBoost training, parameters optimzation and validation
│ └── 03_model_interpretation.ipynb -> Interpreting model's results, Feature Imporance and SHAP
├── processed_data/
│ └── clients_info.csv
├── model
│ └── churn_model.pkl
├── app.py -> Streamlit interface as model's deploy
├── README.md
└── requirements.txt
- SQL data extraction from public cloud datasets
- End-to-end churn analysis using Python
- Kaplan-Meier and survival modeling
- Feature engineering for customer behavior
- Business-driven data storytelling and interpretation
This project includes a web application built using Streamlit, allowing you to interact with the churn prediction model directly from your browser.
-
Install Dependencies
Make sure you have Python installed (version 3.8 or higher). Then, install the required packages with:pip install -r requirements.txt
-
Run the App
In the root directory of the project, run:streamlit run app.py
-
Access the app
After executing the command above, Streamlit will automatically start a local server and display a URL in your terminal, such as:http://localhost:8501
- 🧠 Model Deployment: Integrates a production-ready XGBoost classification model for churn prediction
- 🧾 Manual Data Input: Accepts user-defined inputs including
age
,gender
,number_of_orders
, andtotal_spent
- 🧮 Dynamic Feature Engineering: Automatically computes
average_ticket
as a derived feature (total_spent / number_of_orders
) - 📈 Churn Inference: Outputs binary churn prediction (
0
= active,1
= churn) in real time - 🧠 Model Explainability: Integrates SHAP (SHapley Additive Explanations) to generate global and local interpretability visualizations
- 📊 Visual Insights: Includes force plots and summary plots to showcase feature impact on predictions
- 🚀 End-to-End Pipeline: Demonstrates the full ML lifecycle — from data preprocessing to model inference and explainability — in a single interactive interface