Skip to content

An end-to-end MLOps pipeline for emotion detection. Features data versioning with DVC + AWS S3, model training and evaluation with MLflow, CI/CD via GitHub Actions, FastAPI serving, Docker containerization, AWS EC2 deployment, and experiment tracking on DagsHub.

License

Notifications You must be signed in to change notification settings

AbhaySingh71/MLops-emotion-detection

Repository files navigation

Emotion Detection MLOps Project

GitHub Stars Docker Pulls

Introduction

This project is a complete end-to-end MLOps pipeline for an Emotion Detection system built using FastAPI, Scikit-learn, and NLTK. It leverages modern DevOps and MLOps practices to ensure reproducibility, scalability, and automation. The application detects emotions in text and provides an API endpoint for integration.

The system is designed with CI/CD, model registry, automated training, experiment tracking, and deployment pipelines, making it production-ready.

πŸ“Œ Sample Image

Here’s the FastAPI interface of the Emotion Detection system:

FastAPI Interface


⚑ Tech Stack

  • Machine Learning: Scikit-learn, NLTK
  • API Framework: FastAPI (Cookiecutter template)
  • Experiment Tracking & Model Registry: MLflow with DagsHub integration
  • Data Versioning: DVC (backed by AWS S3)
  • CI/CD: GitHub Actions
  • Deployment: Docker + AWS EC2
  • Container Registry: DockerHub
  • Secrets Management: GitHub Secrets

πŸ› οΈ Key Features

  • -> Data Versioning with DVC β†’ Stores datasets and models in AWS S3 for reproducibility.
  • -> Experiment Tracking β†’ All experiments tracked with MLflow + DagsHub UI.
  • -> Model Registry β†’ Automatic promotion of the best model to production.
  • -> Hyperparameter Tuning β†’ Automated tuning logged in MLflow.
  • -> Dockerized Application β†’ Ensures consistent deployment across environments.
  • -> CI/CD Pipeline β†’ GitHub Actions pipeline automates testing, training, building, and deployment.
  • -> Cloud Deployment β†’ Hosted on AWS EC2, pulling latest Docker images from DockerHub.

πŸ”„ MLOps Workflow

  1. Data & Model Management

    • Datasets stored and versioned with DVC in AWS S3.
    • Preprocessing and feature engineering pipelines tracked.
  2. Experimentation

    • MLflow logs training runs (accuracy, precision, recall, etc.).
    • Hyperparameter tuning experiments stored in MLflow.
  3. Model Registry

    • Best-performing models are automatically promoted using a promotion script.
    • Registry maintained in MLflow (via DagsHub).
  4. CI/CD Pipeline (GitHub Actions)

    • Runs unit tests for ML models and FastAPI endpoints.
    • Executes DVC repro to rebuild pipelines if data/code changes.
    • Builds Docker image and pushes to DockerHub.
    • Deploys containerized app to AWS EC2.
  5. Deployment

    • Application runs on FastAPI with Uvicorn.
    • Served via Docker container on AWS EC2 (Ubuntu).

πŸ“‚ Project Highlights

Data Version Control (DVC)

  • Stores large files (datasets, models) in AWS S3.
  • Ensures reproducibility across environments.

MLflow + DagsHub

  • Tracks experiments, metrics, and artifacts.
  • Centralized model registry for production models.

Hyperparameter Tuning

  • Automated tuning with MLflow logging.
  • Compares experiments for performance improvements.

CI/CD Pipeline

  • Tests β†’ Build β†’ Push β†’ Deploy fully automated with GitHub Actions.
  • Uses appleboy/ssh-action for secure AWS deployment.

Docker & AWS

  • Dockerized FastAPI app.
  • Auto-pulled from DockerHub into AWS EC2 instance.
  • Runs on port 80 mapped to FastAPI 8000.

πŸš€ Deployment Flow

  1. Push code β†’ GitHub Actions triggers.
  2. Run tests + DVC pipeline.
  3. Log experiments to MLflow.
  4. Build and push Docker image β†’ DockerHub.
  5. SSH into EC2 β†’ Pull image & restart container.
  6. Application live at http://<EC2-IP>.

Project Organization

β”œβ”€β”€ LICENSE
β”œβ”€β”€ Makefile           <- Makefile with commands like `make data` or `make train`
β”œβ”€β”€ README.md          <- The top-level README for developers using this project.
β”œβ”€β”€ data
β”‚Β Β  β”œβ”€β”€ external       <- Data from third party sources.
β”‚Β Β  β”œβ”€β”€ interim        <- Intermediate data that has been transformed.
β”‚Β Β  β”œβ”€β”€ processed      <- The final, canonical data sets for modeling.
β”‚Β Β  └── raw            <- The original, immutable data dump.
β”‚
β”œβ”€β”€ docs               <- A default Sphinx project; see sphinx-doc.org for details
β”‚
β”œβ”€β”€ models             <- Trained and serialized models, model predictions, or model summaries
β”‚
β”œβ”€β”€ notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
β”‚                         the creator's initials, and a short `-` delimited description, e.g.
β”‚                         `1.0-jqp-initial-data-exploration`.
β”‚
β”œβ”€β”€ references         <- Data dictionaries, manuals, and all other explanatory materials.
β”‚
β”œβ”€β”€ reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
β”‚Β Β  └── figures        <- Generated graphics and figures to be used in reporting
β”‚
β”œβ”€β”€ requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
β”‚                         generated with `pip freeze > requirements.txt`
β”‚
β”œβ”€β”€ setup.py           <- makes project pip installable (pip install -e .) so src can be imported
β”œβ”€β”€ src                <- Source code for use in this project.
β”‚Β Β  β”œβ”€β”€ __init__.py    <- Makes src a Python module
β”‚   β”‚
β”‚Β Β  β”œβ”€β”€ data           <- Scripts to download or generate data
β”‚Β Β  β”‚Β Β  └── make_dataset.py
β”‚   β”‚
β”‚Β Β  β”œβ”€β”€ features       <- Scripts to turn raw data into features for modeling
β”‚Β Β  β”‚Β Β  └── build_features.py
β”‚   β”‚
β”‚Β Β  β”œβ”€β”€ models         <- Scripts to train models and then use trained models to make
β”‚   β”‚   β”‚                 predictions
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ predict_model.py
β”‚Β Β  β”‚Β Β  └── train_model.py
β”‚   β”‚
β”‚Β Β  └── visualization  <- Scripts to create exploratory and results oriented visualizations
β”‚Β Β      └── visualize.py
β”‚
└── tox.ini            <- tox file with settings for running tox; see tox.readthedocs.io

πŸ§ͺ Testing

πŸ”Ή 1. Data Validation Tests

  • Input Format Checks – Verify that the raw text data is non-empty, correctly encoded (UTF-8), and free of invalid symbols.
  • Class Distribution Tests – Ensure that training/validation splits have balanced representation of emotion classes (e.g., joy, anger, sadness, fear, surprise, neutral).
  • Text Cleaning Functions – Unit tests for stopword removal, lemmatization, and tokenization functions using NLTK.

πŸ”Ή 2. Model Testing

  • Prediction Shape & Type – Check if model outputs valid emotion labels with probability scores.
  • Confidence Thresholding – Ensure probabilities sum to 1 (softmax check) and exceed a minimum confidence level.
  • Overfitting Check – Validate that training accuracy vs. validation accuracy difference stays within tolerance.
  • Hyperparameter Sensitivity Tests – Re-run the model with different parameter sets (tracked in MLflow) to verify stability of results.

πŸ”Ή 3. API Testing (FastAPI)

Unit API Tests

  • Verify /predict returns 200 OK.
  • Validate correct schema in response:
{
  "text": "I am very happy today!",
  "prediction": "joy",
  "confidence": 0.94
}


Run tests locally:

```bash
pytest tests/

πŸ“Œ DAG Workflow

Here’s the DAG representation of the pipeline:

DAG Workflow

πŸ“¦ Docker

Build and run locally:

docker build -t emotion_detection:latest .
docker run -p 8000:8000 emotion_detection:latest

πŸ“œ API Usage

FastAPI app will be available at:

http://<EC2-IP>:80/docs

Example request:

{
  "text": "I am very happy today!"
}

Response:

{
  "emotion": "joy"
}

πŸ”‘ Secrets (GitHub)

  • DAGSHUB_PAT β†’ Access token for MLflow + DagsHub
  • DOCKER_HUB_USERNAME β†’ DockerHub username
  • DOCKER_HUB_ACCESS_TOKEN β†’ DockerHub token
  • EC2_HOST β†’ AWS EC2 Public IP/Domain
  • EC2_USER β†’ EC2 username (e.g., ubuntu)
  • EC2_SSH_KEY β†’ SSH private key

🎯 Future Improvements

  • Add monitoring with Prometheus + Grafana
  • Add canary deployments for safer model rollouts
  • Automate data drift detection
  • Integrate with Kubernetes for scaling

🀝 Contributing

Feel free to fork the repo, raise issues, and submit PRs.


πŸ“ License

This project is licensed under the MIT License.


πŸ“§ Contact

Author: Abhay Singh GitHub: AbhaySingh71 Email: [email protected]

About

An end-to-end MLOps pipeline for emotion detection. Features data versioning with DVC + AWS S3, model training and evaluation with MLflow, CI/CD via GitHub Actions, FastAPI serving, Docker containerization, AWS EC2 deployment, and experiment tracking on DagsHub.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published