This repository contains code and analyses for building an XGBoost model to classify individuals with diabetes based on healthcare and lifestyle survey data. The primary focus is on feature selection using a variety of methods to reduce the number of features while maximising f1.
+---notebooks
| +---Feature_Importance_Cheatsheet <- This provides example code for feature importance for an XGBoost model.
| +---Feature_Importance_Workshop <- This trains and XGBoost model for Feature Selection.
| +---load_diabetes_data <- This code was used to load the data from UCI and balance the classes.
|
| README.md <- Quick start guide
The Feature_Importance_Workshop and the Feature_importance_cheatsheet runs in google collab notebooks.
This dataset comprises healthcare statistics and lifestyle survey information about individuals in the United States. It was collected by the Centers for Disease Control and Prevention (CDC) and is publicly available through the UCI Machine Learning Repository.
Citation: Markelle Kelly, Rachel Longjohn, Kolby Nottingham, The UCI Machine Learning Repository
The work in this repository has been influenced by a number of helpful articles and tutorials:
- Feature Importance and Feature Selection with XGBoost in Python
- Calculate Feature Importance with Python
- Feature Selection with Real and Categorical Data
- A Guide to 21 Feature Importance Methods and Packages in Machine Learning
- Why You Should Stop Using Recursive Feature Elimination
- Python Feature Importance Libraries
- Best Practice to Calculate and Interpret Model Feature Importance
- XGBoost Feature Importance