A production-grade data and machine learning pipeline designed to collect, process, and make predictions using economic news data from the MQL5 website.
The architecture integrates Python, PySpark, Airflow, Docker, Kubernetes, GCP, FastAPI, MLflow, MySQL, and JavaScript into a scalable and modular system for automated data workflows.
⚠️ Note:
Certain components of this pipeline are intentionally left out in this repository for privacy and environment-specific reasons.
The provided modules represent the core production logic and structure.
This diagram outlines the end-to-end flow — from data ingestion to transformation, modeling, tracking, and real-time prediction delivery.
| Layer | Technology | Purpose |
|---|---|---|
| Data Ingestion | Python | Scrape MQL5 economic news data |
| Schema Handling | Python (pandas) | Repair and normalize broken schemas |
| Processing Engine | PySpark | Distributed data processing and structuring |
| Database | MySQL | Store cleaned and transformed data |
| Orchestration | Apache Airflow | Automate and schedule pipeline tasks |
| API Layer | FastAPI | Model serving and inference endpoints |
| Experiment Tracking | MLflow | Track, compare, and register models |
| Visualization | JavaScript | Real-time prediction dashboard |
| Deployment | Docker & Kubernetes (GCP) | Scalable, containerized production deployment |
- Scrapes the MQL5 website for economic event and news data.
- Uses Python scripts to extract, format, and store raw data as CSV files.
- Establishes a consistent and traceable data input pipeline.
- Runs schema validation and correction using a dedicated Python script.
- Fixes missing or misaligned columns, enforces consistent data types, and standardizes field naming.
- Ensures clean and structured data for distributed processing.
- Utilizes PySpark for distributed data transformation and normalization.
- Processes large datasets efficiently, preparing them for storage and downstream tasks.
- Outputs structured, uniform datasets for transformation and analysis.
- Conducts final data cleanup and feature selection within MySQL.
- Removes redundant fields, applies filters, and stores the refined dataset.
- Produces a high-quality feature set ready for machine learning.
Represents the training, validation, and testing stages of the ML lifecycle.
- Training: Trains models using the prepared MySQL dataset.
- Validation: Assesses model accuracy and performance metrics (MSE, R², etc.).
- Testing: Evaluates the model on unseen data to confirm reliability.
A FastAPI service exposes REST endpoints to trigger training, validation, and prediction.
- Uses MLflow to record metrics, parameters, and artifacts for every experiment.
- Manages all model versions through the MLflow Model Registry.
- Enables experiment reproducibility and controlled production rollout.
- A JavaScript dashboard visualizes live predictions and key metrics in real time.
- Communicates with FastAPI endpoints for streaming results and monitoring performance.
- Provides actionable insights for economic data and event analysis.
- Each stage of the pipeline is Dockerized for environment consistency.
- Deployed on Kubernetes (GCP) for scaling, load balancing, and reliability.
- Airflow orchestrates retraining, monitoring, and periodic updates.
- Designed for modular scaling — each component operates independently in production.
The MQL5 Economic News Data Pipeline 2025 delivers a scalable, modular, and automated production pipeline for financial and economic data.
It unifies the full ML lifecycle — ingestion, schema repair, distributed processing, model training, versioning, and deployment — in a robust, cloud-native environment.
This repository serves as a reference architecture and implementation baseline for enterprise-grade ML systems focused on automation, reproducibility, and performance.
A full playlist walkthrough explaining this pipeline — including architecture, components, and workflow execution — will be uploaded to Big Data Brain (@bdb5905) on YouTube.
Subscribe to the channel to get notified when it goes live and for more content on Big Data, Machine Learning Pipelines, and Production Systems.
