Skip to content

chrmei/heartbeat_classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Heartbeat Classification Project

Overview

This project implements machine learning and deep learning architectures to classify cardiac signals from ECG (electrocardiogram) data. The project combines two datasets: MIT-BIH Arrhythmia Dataset (5 classes) and PTB Diagnostic ECG Database (2 classes - normal/abnormal).

Project Difficulty: 8/10

Current Status: βœ… PROJECT COMPLETED - All modeling phases completed with results exceeding benchmark performance.

Key Results

Based on the final project report, our models achieved the following performance:

Dataset Model Accuracy Precision Recall F1 Score
MIT-BIH CNN8 (Optimized) 98.51% 90.62% 94.24% 92.36%
PTB CNN8 + Transfer Learning 98.42% 97.51% 98.64% 98.05%

Benchmark Comparison:

  • MIT-BIH: Exceeded benchmark [2] by ~5% (benchmark: 93.40%)
  • PTB: Exceeded benchmark [2] by ~2.5% (benchmark: 95.90%)

For detailed results and methodology, see the Final Report.

Datasets

  • mitbih_train.csv (87,554 samples, 188 features including label)
  • mitbih_test.csv (21,892 samples)
  • ptbdb_normal.csv (4,046 samples)
  • ptbdb_abnormal.csv (10,506 samples)

Each row represents one heartbeat segment with 187 time points + 1 label column.

Data Quality Summary

Based on comprehensive data audit analysis:

Dataset Samples Features Missing Values Duplicates Memory Usage
MIT-BIH Train 87,554 188 0 0 131.7 MB
MIT-BIH Test 21,892 188 0 0 32.9 MB
PTB Normal 4,046 188 0 1 6.1 MB
PTB Abnormal 10,506 188 0 6 15.8 MB

Key Findings:

  • βœ… No missing values in any dataset
  • βœ… Clean data structure with consistent 188 features (187 ECG samples + 1 label)
  • βœ… Minimal duplicates (removed during preprocessing)
  • βœ… Memory efficient data storage
  • ⚠️ Class imbalance present in both datasets (addressed with SMOTE sampling)

Project Structure

heartbeat_classification/
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ original/          # Raw Kaggle data (βœ… Complete)
β”‚   β”‚   β”œβ”€β”€ mitbih_train.csv
β”‚   β”‚   β”œβ”€β”€ mitbih_test.csv
β”‚   β”‚   β”œβ”€β”€ ptbdb_normal.csv
β”‚   β”‚   └── ptbdb_abnormal.csv
β”‚   β”œβ”€β”€ processed/         # Cleaned & preprocessed data (βœ… Complete)
β”‚   β”‚   β”œβ”€β”€ mitbih/
β”‚   β”‚   └── ptb/
β”‚   └── interim/           # Feature-engineered datasets (βœ… Complete)
β”‚       β”œβ”€β”€ mitbih_train_features.csv
β”‚       β”œβ”€β”€ mitbih_test_features.csv
β”‚       β”œβ”€β”€ ptbdb_normal_features.csv
β”‚       └── ptbdb_abnormal_features.csv
β”œβ”€β”€ notebooks/             # Analysis notebooks (βœ… Complete)
β”‚   β”œβ”€β”€ 01_data_exploration.ipynb
β”‚   β”œβ”€β”€ 02_preprocessing.ipynb
β”‚   β”œβ”€β”€ 03_A_*.ipynb       # MIT-BIH baseline models
β”‚   β”œβ”€β”€ 03_B_*.ipynb       # PTB baseline models
β”‚   β”œβ”€β”€ 04_A_*.ipynb       # MIT-BIH deep learning models
β”‚   β”œβ”€β”€ 04_B_*.ipynb       # PTB deep learning models
β”‚   β”œβ”€β”€ 05_A_DL_SHAP.ipynb # MIT-BIH interpretability
β”‚   β”œβ”€β”€ 05_B_DL_SHAP.ipynb # PTB interpretability
β”‚   └── archive/           # Archived development notebooks
β”‚       β”œβ”€β”€ christian/
β”‚       └── julia/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ utils/             # Utility functions (βœ… Complete)
β”‚   β”‚   β”œβ”€β”€ preprocessing.py
β”‚   β”‚   β”œβ”€β”€ evaluation.py
β”‚   β”‚   β”œβ”€β”€ model_saver.py
β”‚   β”‚   β”œβ”€β”€ audit_report.py
β”‚   β”‚   └── dl_architectures.py
β”‚   └── visualization/     # Visualization tools (βœ… Complete)
β”‚       β”œβ”€β”€ visualization.py
β”‚       └── confusion_matrix.py
β”œβ”€β”€ models/                 # Saved trained models (βœ… Complete)
β”‚   β”œβ”€β”€ MIT_02_01_baseline_models_randomized_search_no_sampling/
β”‚   β”œβ”€β”€ MIT_02_02_baseline_models_randomized_search_sampling/
β”‚   β”œβ”€β”€ MIT_02_03_dl_models/
β”‚   └── PTB_04_02_dl_models/
β”œβ”€β”€ reports/
β”‚   β”œβ”€β”€ data_audit/         # Data quality reports (βœ… Complete)
β”‚   β”œβ”€β”€ baseline_models/    # Baseline model results (βœ… Complete)
β”‚   β”‚   β”œβ”€β”€ MIT_02_01_RANDOMIZED_SEARCH/
β”‚   β”‚   └── MIT_02_02_RS_SAMPLING/
β”‚   β”œβ”€β”€ deep_learning/      # Deep learning model results (βœ… Complete)
β”‚   β”‚   β”œβ”€β”€ cnn8_transfer/
β”‚   β”‚   β”œβ”€β”€ models_optimization/
β”‚   β”‚   └── model_comparison.csv
β”‚   β”œβ”€β”€ interpretability/   # SHAP analysis results (βœ… Complete)
β”‚   β”‚   β”œβ”€β”€ SHAP_MIT/
β”‚   β”‚   └── SHAP_PTB/
β”‚   └── renderings/         # Project reports (βœ… Complete)
β”‚       β”œβ”€β”€ 01_Rendering 1.pdf
β”‚       β”œβ”€β”€ 02_Rendering2-Report.pdf
β”‚       └── 03_Final Report.pdf
β”œβ”€β”€ docs/                  # Project documentation (βœ… Complete)
β”‚   β”œβ”€β”€ knowledge/
β”‚   └── ProjectRequirements/
β”œβ”€β”€ tests/                 # Test suite
β”œβ”€β”€ requirements.txt       # Dependencies (βœ… Complete)
β”œβ”€β”€ requirements-lock.txt  # Locked dependency versions (βœ… Complete)
β”œβ”€β”€ pyproject.toml         # Project configuration (βœ… Complete)
β”œβ”€β”€ README.md              # This file (βœ… Complete)
└── CONTRIBUTING.md        # Contribution guidelines (βœ… Complete)

Notebook Organization

The project notebooks follow a systematic numbering scheme:

  • 01-02: Data exploration and preprocessing
  • 03_A: MIT-BIH baseline models (RandomizedSearch, GridSearch, evaluation)
  • 03_B: PTB baseline models (LazyClassifier, GridSearch, evaluation)
  • 04_A: MIT-BIH deep learning models (CNN, DNN, LSTM, optimization)
  • 04_B: PTB deep learning models (Transfer learning)
  • 05_A/B: Model interpretability (SHAP analysis)

For detailed notebook documentation, see notebooks/README.md.

Archived Notebooks: Development notebooks from earlier iterations are preserved in notebooks/archive/ for reference

Features & Capabilities

βœ… Implemented Features

1. Data Quality Analysis

  • Comprehensive data audit system (src/utils/audit_report.py)
  • Automated generation of data quality reports for all datasets
  • Statistical analysis of missing values, duplicates, and data types
  • Memory usage and performance metrics

2. Data Preprocessing & Feature Engineering

  • Complete preprocessing pipeline (src/utils/preprocessing.py)
  • Feature engineering with statistical and frequency domain features
  • Class imbalance handling with SMOTE (selected as optimal method)
  • Data validation and quality checks
  • Duplicate removal for PTB dataset

3. Baseline Model Development

  • Comprehensive model comparison framework
  • Multiple algorithms: XGBoost, Random Forest, SVM, Logistic Regression, KNN, Decision Tree, LDA, ANN
  • RandomizedSearch and GridSearch hyperparameter optimization
  • Performance metrics: Accuracy, Precision, Recall, F1-score
  • Model persistence and evaluation utilities

4. Deep Learning Models

  • CNN architectures (inspired by Kachuee et al. 2018)
  • DNN and LSTM models
  • Transfer learning from MIT-BIH to PTB dataset
  • Model optimization with dropout and batch normalization
  • Training on Google Colab with GPU acceleration

5. Model Interpretability

  • SHAP (SHapley Additive exPlanations) analysis
  • Feature importance visualization
  • Decision pattern analysis for both datasets

6. Visualization Tools

  • Advanced ECG plotting utilities (src/visualization/visualization.py)
  • Confusion matrix visualization (src/visualization/confusion_matrix.py)
  • Support for single and multiple heartbeat visualization
  • Peak detection and signal analysis capabilities
  • Customizable plotting parameters and export options

Setup Instructions

1. Clone the Repository

git clone <repository-url>
cd heartbeat_classification

2. Create Virtual Environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

3. Install Dependencies

pip install -r requirements.txt

4. Download Data

Place the original data files in data/original/ directory:

  • mitbih_train.csv
  • mitbih_test.csv
  • ptbdb_normal.csv
  • ptbdb_abnormal.csv

Note: The datasets are available from Kaggle.

Usage Examples

πŸ“Š Data Analysis

Generate comprehensive data audit reports:

# Run data audit analysis
python -c "from src.utils.audit_report import generate_data_audit_report; generate_data_audit_report()"

πŸ““ Jupyter Notebooks

Explore the data and models interactively:

# Launch Jupyter notebook
jupyter notebook notebooks/01_data_exploration.ipynb
jupyter notebook notebooks/02_preprocessing.ipynb

# Baseline models
jupyter notebook notebooks/03_A_02_01_baseline_models_randomized_search.ipynb
jupyter notebook notebooks/03_B_02_baseline_models_lazy_classifier.ipynb

# Deep learning models
jupyter notebook notebooks/04_A_02_CNN_models_smote.ipynb
jupyter notebook notebooks/04_B_01_CNN_Transfer.ipynb

# Interpretability
jupyter notebook notebooks/05_A_DL_SHAP.ipynb
jupyter notebook notebooks/05_B_DL_SHAP.ipynb

πŸ”§ Development

For development and testing:

# Run data audit
python src/utils/audit_report.py

# Test visualization utilities
python -c "from src.visualization import plot_heartbeat; import numpy as np; plot_heartbeat(np.random.randn(187))"

# Note: evaluation.py is a module, not a script. Use it in notebooks or import:
# from src.utils import evaluate_model

Project Timeline

Step 1: Data Mining & Visualization βœ… COMPLETED

  • Create project structure
  • Set up requirements.txt
  • Write README
  • Load and explore datasets
  • Generate data audit reports with comprehensive analysis
  • Document data quality issues and class imbalance
  • Create visualization utilities for ECG plotting

Step 2: Pre-Processing & Feature Engineering βœ… COMPLETED

  • Handle class imbalance (SMOTE selected as optimal method)
  • Normalize/standardize signals
  • Split train/validation/test sets properly
  • Feature extraction: statistical and frequency domain features
  • Complete preprocessing pipeline implementation
  • Duplicate removal for PTB dataset

Step 3: Baseline Modeling βœ… COMPLETED

  • Baseline models: Multiple algorithms tested
  • RandomizedSearch for initial model comparison
  • GridSearch for hyperparameter optimization
  • Comprehensive model comparison with SMOTE sampling
  • Performance evaluation and results documentation
  • Model persistence and evaluation utilities

Step 4: Advanced Optimization βœ… COMPLETED

  • GridSearchCV for best models
  • Extreme values analysis (RR-Distance analysis)
  • Advanced hyperparameter tuning
  • Model selection and final evaluation

Step 5: Deep Learning Implementation βœ… COMPLETED

  • Deep Learning models: CNN, DNN, LSTM architectures
  • Transfer learning from MIT-BIH to PTB dataset
  • Model optimization with dropout and batch normalization
  • Model interpretability: SHAP values analysis
  • Advanced neural network architectures

Step 6: Final Report & Documentation βœ… COMPLETED

  • Compile all reports
  • Clean, document, and organize code
  • Final report generation
  • Results documentation and comparison with benchmark

Key Considerations

Metrics: Focus on F1-score, precision, recall (due to class imbalance). Also track accuracy, confusion matrix.

Challenges Addressed:

  • βœ… Severe class imbalance in both datasets (solved with SMOTE)
  • βœ… High dimensionality (187 features) - handled by deep learning architectures
  • βœ… Two separate datasets with different objectives (5-class vs 2-class)
  • βœ… Time series nature - handled with appropriate architectures
  • βœ… Model interpretability - addressed with SHAP analysis

Success Criteria:

  • βœ… Beat benchmark performance (exceeded by 2.5-5%)
  • βœ… Robust model with good generalization
  • βœ… Clear, professional reports with business insights
  • βœ… Model interpretability for clinical validation

Results Summary

πŸ† Final Model Performance

MIT-BIH Arrhythmia Classification (5 classes):

  • Best Model: CNN8 (Optimized with dropout and batch normalization)
  • Accuracy: 98.51%
  • Precision: 90.62%
  • Recall: 94.24%
  • F1 Score: 92.36%

PTB Myocardial Infarction Detection (2 classes):

  • Best Model: CNN8 with Transfer Learning (last residual block unfrozen)
  • Accuracy: 98.42%
  • Precision: 97.51%
  • Recall: 98.64%
  • F1 Score: 98.05%

πŸ“Š Key Achievements

  1. Exceeded Benchmark Performance

    • MIT-BIH: +5% improvement over benchmark [2]
    • PTB: +2.5% improvement over benchmark [2]
  2. Robust Model Development

    • Comprehensive baseline model comparison
    • Advanced deep learning architectures
    • Transfer learning implementation
  3. Model Interpretability

    • SHAP analysis for both datasets
    • Feature importance visualization
    • Clinically relevant pattern identification
  4. Complete Documentation

    • Comprehensive data audit
    • Detailed model evaluation reports
    • Final project report with methodology and results

Bibliography

Key research articles referenced in this project:

  1. Kachuee, M., Fazeli, S., & Sarrafzadeh, M. (2018). ECG Heartbeat Classification: A Deep Transferable Representation. CoRR. doi: 10.1109

  2. Murat, F., Yildirim, O., Talo, M., Baloglu, U. B., Demir, Y., & Acharya, U. R. (2020). Application of deep learning techniques for heartbeats detection using ECG signals-analysis and review. Computers in Biology and Medicine. doi:10.1016/j.compbiomed.2020.103726

  3. Ansari, Y., Mourad, O., Qaraqe, K., & Serpedin, E. (2023). Deep learning for ECG Arrhythmia detection and classification: an overview of progress for period 2017–2023. doi: 10.3389/fphys.2023.1246746

For complete bibliography, see the Final Report.

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

See CONTRIBUTING.md for detailed guidelines.

License

This project is part of the DataScientest Data Scientist training program.

Contact

For questions about this project, please refer to the DataScientest training materials and instructor support.


Project Team: Christian Meister, Julia Schmidt, Tzu-Jung Huang
Completion Date: November 2025

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published