Predict heart disease risk using machine learning.
Here's a clean project structure for your README.md
(formatted in Markdown):
# Project Structure
heart-disease-prediction/
├── **data/**
│ ├── **raw/** # Raw dataset (`project 2.csv`)
│ └── **processed/** # Cleaned/preprocessed data (e.g., `cleaned_data.csv`)
├── **notebooks/** # Jupyter notebooks for analysis
│ ├── `eda.ipynb` # Exploratory Data Analysis
│ └── `model_training.ipynb` # Model experiments and evaluation
├── **src/** # Python scripts
│ ├── `data_preprocessing.py` # Data cleaning/feature engineering
│ ├── `train_model.py` # Model training and tuning
│ └── `app.py` # (Optional) CLI/Flask deployment if we have time
├── **models/** # Trained models (e.g., `random_forest.pkl`)
├── **results/** # Visualizations, metrics, and reports
│ ├── `confusion_matrix.png`
│ └── `feature_importance.png`
├── **requirements.txt** # Python dependencies
├── **README.md** # Project overview and instructions
└── **OTHERS** # MIT License (or others if we need any)
data/raw
: Contains the original unprocessed dataset.data/processed
: Stores cleaned data after preprocessing.notebooks
: For exploratory analysis and model prototyping.src
: Reusable Python scripts for data cleaning, modeling, and deployment.models
: Saves trained models for later use.results
: Stores visualizations, performance metrics, and reports.
- Use branches
prajith-ravisankar
andemilio-santamaria
andlasombra7
- Merge changes into
dev
for daily collaboration after we agree on. - Final stable code goes into
main
(protected branch).
Goal: Initialize repository, define roles, and finalize requirements.
- Main Todo 1.1: GitHub Setup
-
Sub-todo 1.1.1: Create a GitHub repository -
Sub-todo 1.1.2: Creat branches:-
main
: Protected branch for final merges. -
dev
: Shared development branch for daily work. -
prajith-ravisankar
-
emilio-santamaria
teammate has to create their branch and confirm. -
lasombra7
teammate has to accept the invite from Github and create their branch to start contributions
-
-
-
Main Todo 1.2: Requirements Finalization-
Sub-todo 1.2.1: Review the PDF requirements and start working on:- Data cleaning steps (missing values, outliers).- Models to compare (e.g., Logistic Regression, Random Forest, XGBoost).- Metrics (accuracy, F1-score, AUC-ROC). -
Sub-todo 1.2.2: confirm on communication platform (we are using discord)
-
Goal: Clean and preprocess the dataset.
-
Main Todo 2.1: Data Cleaning-
Sub-todo 2.1.1: Loadproject 2.csv
and inspect for:Missing values (e.g., emptyCholesterol
orBlood Pressure
entries).Duplicate rows.what to do with outliers? not sure…(e.g.,Age
> 100,Blood Pressure
> 200).
etc…
-
Sub-todo 2.1.2: Handle missing values:Use KNNImputer or IterativeImputer for advanced imputation
-
Sub-todo 2.1.3: Encode categorical variables:Gender
: Male=0, Female=1.
-
- Main Todo 2.2: Feature Engineering
- Sub-todo 2.2.1: Split data into features (
X
) and target (y
). - Sub-todo 2.2.2: Scale numerical features:
- Use StandardScaler (Z-score normalization) for algorithms like SVM or Logistic Regression.
- Sub-todo 2.2.3: Feature selection:
- Use SelectKBest or RFE (Recursive Feature Elimination) to reduce dimensionality.
- Sub-todo 2.2.4: Save preprocessed data as
cleaned_data.csv
.
- Sub-todo 2.2.1: Split data into features (
Goal: Generate insights and visualizations.
- Main Todo 3.1: Univariate Analysis
- Sub-todo 3.1.1: Plot distributions for:
Age
,Cholesterol
,Blood Pressure
(histograms).Heart Disease
(pie chart for class balance).
- Sub-todo 3.1.2: Document observations (e.g., "30% of patients have heart disease").
- Sub-todo 3.1.1: Plot distributions for:
- Main Todo 3.2: Bivariate/Multivariate Analysis
- Sub-todo 3.2.1: Correlation heatmap (features vs.
Heart Disease
). - Sub-todo 3.2.2: Boxplots for
Cholesterol
vs.Heart Disease
. - Sub-todo 3.2.3: Pairplot for key features (e.g.,
Age
,Blood Pressure
).
- Sub-todo 3.2.1: Correlation heatmap (features vs.
Goal: Train and compare baseline models.
[x] Main Todo 4.1: Model Training-
Sub-todo 4.1.1: Train 3 models:Logistic Regression.Random Forest.XGBoost.
-
Sub-todo 4.1.2: Usetrain_test_split
(80-20 split).
-
-
Main Todo 4.2: Baseline Evaluation-
Sub-todo 4.2.1: Calculate metrics:Accuracy, Precision, Recall, F1-score, AUC-ROC.
-
Sub-todo 4.2.2: Document results in a shared spreadsheet.
-
Goal: Improve model performance.
- Main Todo 5.1: Hyperparameter Tuning
- Sub-todo 5.1.1: Use
GridSearchCV
orRandomizedSearchCV
for:- Random Forest (tune
n_estimators
,max_depth
). - XGBoost (tune
learning_rate
,max_depth
).
- Random Forest (tune
- Sub-todo 5.1.2: Re-evaluate metrics post-tuning.
- Sub-todo 5.1.1: Use
- Main Todo 5.2: Feature Importance
- Sub-todo 5.2.1: Plot feature importance for the best model.
- Sub-todo 5.2.2: Identify top 5 risk factors (e.g.,
Cholesterol
,Age
).
Goal: Prepare a simple deployment and final report.
- Main Todo 6.1: Deployment
- Sub-todo 6.1.1: Create a
predict()
function for new data. - Sub-todo 6.1.2: Build a basic CLI or Flask app for predictions.
- Sub-todo 6.1.1: Create a