Stack Overflow Developer Salary Prediction

A complete end-to-end machine learning project predicting developer salaries using the Stack Overflow 2023 Survey dataset. Built as part of mastering Chapter 2 concepts from "Hands-On Machine Learning" by Aurélien Géron.

🎯 Project Overview

Goal: Predict yearly developer salaries (ConvertedCompYearly) using survey responses
Dataset: Stack Overflow 2023 Survey (~48K salary records)
Final Performance: $52,569 RMSE using ensemble methods

🚀 Key Results

88% Performance Improvement: From $432K RMSE (initial disaster) to $52K RMSE (final ensemble)
Advanced Pipeline: Custom transformers, target encoding, and leak-proof preprocessing
Ensemble Victory: VotingRegressor combining Random Forest, and Gradient Boosting

🔧 Technical Architecture

Data Preprocessing Pipeline

Multi-label Categorical Handling: Processed semicolon-separated survey responses (languages, platforms, tools)
Strategic Bucketization: Reduced high-cardinality features using domain knowledge
Custom Transformers: Built AdvancedFeatureEngineer and OrgSizeBinner for pipeline integration

Feature Engineering

Experience Metrics: Consistency ratios, professional experience factors, senior role indicators
Salary-Proportional Encoding: Target encoding for tech stack features based on actual salary impact
Outlier Management: Capped salaries ($50K-$750K) and experience years (max 30) for realistic predictions

Model Development

Baseline Models: Linear Regression, Random Forest establishing $53-54K RMSE baseline
Advanced Algorithms: Gradient Boosting, XGBoost, Extra Trees, Ridge, ElasticNet, SVR
Hyperparameter Optimization: GridSearchCV with 108 parameter combinations across 3-fold CV
Final Ensemble: VotingRegressor combining top 3 models for $52,569 RMSE

🐛 Major Debugging Victories

The random_state=42 Saga

Problem: Inconsistent train/test splits causing performance confusion across experiments
Solution: Environment restart with single split using random_state=42
Learning: Data integrity trumps model complexity - always fix random seeds first

The OrgSize Double-Encoding Bug

Problem: Organization size was pre-encoded to integers, then custom transformer was binning encoded values
Impact: Artificially inflated feature importance (27.8%) and poor model performance
Solution: Pass raw categorical data to transformers, encode only once in pipeline

Pipeline Integration Issues

Problem: Feature engineering applied outside pipeline, models trained on original dataset
Solution: Custom transformer inheriting from BaseEstimator and TransformerMixin
Result: All 6 engineered features properly integrated into model training

📊 Feature Importance Insights

Organization Size (27.8%): Company size strongly correlates with salary ($160K → $75K gradient)
Years Professional Experience (22.7%): Experience premium confirmed across all models
Programming Languages (17.1%): Technology stack significantly impacts compensation
Web Frameworks (6.8%): Specialization in modern frameworks adds salary premium
Developer Type (6.5%): Role seniority affects compensation structure

🛠️ Technologies Used

Core: Python, pandas, scikit-learn, NumPy
Models: Random Forest, Gradient Boosting, XGBoost, Linear Regression, Ridge, ElasticNet, SVR
Preprocessing: Custom transformers, ColumnTransformer, StandardScaler
Evaluation: GridSearchCV, cross-validation, ensemble methods

Getting Started

Clone the repository:

git clone <github.com/hasn77/Stack-Overflow-Survey-Dataset-Model>

Install dependencies:
```
pip install -r requirements.txt
```

🎓 Key Learnings

Pipeline Design: Proper data flow prevents encoding errors and ensures reproducibility
Feature Engineering: Domain knowledge drives effective categorical variable handling
Debugging Mindset: Systematic problem-solving more valuable than complex algorithms
Ensemble Methods: Combining complementary algorithms captures different data patterns

🔄 Next Steps

Cross-validation strategies for robust evaluation
Feature selection techniques for optimal feature subset
Advanced ensemble methods (stacking, blending)
Deep learning approaches for complex pattern recognition

"The answer to life, the universe, and everything is 42... and sometimes it's also the random_state that saves your ML project!"

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
.vscode		.vscode
__pycache__		__pycache__
.gitattributes		.gitattributes
.gitignore		.gitignore
Project-roadmap.ipynb		Project-roadmap.ipynb
README.md		README.md
Stack Overflow Survey Prediction Model.ipynb		Stack Overflow Survey Prediction Model.ipynb
app.py		app.py
ensemble_model.joblib		ensemble_model.joblib
journal.ipynb		journal.ipynb
languages.csv		languages.csv
pipeline.joblib		pipeline.joblib
requirements.txt		requirements.txt
survey_results_public.csv		survey_results_public.csv
survey_results_schema.csv		survey_results_schema.csv
tools.csv		tools.csv
transformers.py		transformers.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Stack Overflow Developer Salary Prediction

🎯 Project Overview

🚀 Key Results

🔧 Technical Architecture

Data Preprocessing Pipeline

Feature Engineering

Model Development

🐛 Major Debugging Victories

The random_state=42 Saga

The OrgSize Double-Encoding Bug

Pipeline Integration Issues

📊 Feature Importance Insights

🛠️ Technologies Used

Getting Started

🎓 Key Learnings

🔄 Next Steps

About

Uh oh!

Releases

Packages

Languages

hasn77/Stack-Overflow-Survey-Dataset-Model

Folders and files

Latest commit

History

Repository files navigation

Stack Overflow Developer Salary Prediction

🎯 Project Overview

🚀 Key Results

🔧 Technical Architecture

Data Preprocessing Pipeline

Feature Engineering

Model Development

🐛 Major Debugging Victories

The random_state=42 Saga

The OrgSize Double-Encoding Bug

Pipeline Integration Issues

📊 Feature Importance Insights

🛠️ Technologies Used

Getting Started

🎓 Key Learnings

🔄 Next Steps

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages