📈 Big Data Tick Data ML Trading Pipeline — Jan 2026

1️⃣ System Overview

This project implements a full-stack quantitative ML trading architecture built around:

High-frequency tick ingestion
Distributed data processing
Lakehouse storage (Iceberg + S3 + Postgres catalog)
Model lifecycle management
Backtesting & walk-forward validation
MT5 deployment for live/demo execution
Experiment tracking

The system is modular, scalable, and production-oriented.

2️⃣ Data Source Layer

2.1 Dukascopy Website

Source: Historical tick data
Format: Raw CSV
Granularity: Sub-second

Responsibilities:

Download immutable historical tick data
Maintain original dataset integrity
Provide reproducible base data

3️⃣ Data Replication Layer

3.1 Local → S3 Transfer (NiFi)

Large finalized datasets are transferred:

From local file system
To S3 object storage
Using Apache NiFi

NiFi handles:

Transfer orchestration
Backpressure control
Flow monitoring
Reliability

This layer ensures stable movement into the lakehouse.

4️⃣ Raw CSV Storage

Raw tick data is stored unmodified in CSV format.

Purpose:

Immutable recovery layer
Enables reprocessing
Protects against transformation errors
Maintains auditability

No transformations occur at this stage.

5️⃣ Data Processing Layer

5.1 Python Conversion Stage

Ingest raw CSV
Convert CSV to Parquet
Save as partitioned datasets

Benefits of partitioning:

Faster reads
Efficient Spark processing
Scalable storage

5.2 Spark Processing

Apache Spark performs:

Large-scale cleaning
Sorting
Missing value handling
Timestamp normalization
Feature preparation

Output: ML-ready datasets

6️⃣ Data Version Control

Final datasets are versioned using:

DVC (Data Version Control)

Benefits:

Reproducibility
Dataset lineage
Controlled experimentation
Rollback capability

7️⃣ Data Lakehouse Architecture

The lakehouse is the central storage and analytics layer.

7.1 Storage Components

S3 → Stores Parquet data files
Postgres → Stores Iceberg table metadata
Apache Iceberg → Table format layer

Iceberg provides:

ACID guarantees
Schema evolution
Snapshot isolation
Time travel

7.2 Lakehouse Tables

Structured tables are created on top of S3 data using Iceberg.

These tables represent:

Gold datasets (ML-ready)

7.3 Query Engine

Trino is used as the query engine.

Capabilities:

SQL analytics on Iceberg tables
Distributed query execution
Research queries for feature engineering
Dataset validation before ML usage

8️⃣ Machine Learning Pipeline (FastAPI UI)

The ML system follows a strict lifecycle.

8.1 Model Core

Anomaly / pattern detection models
Consumes gold datasets
Produces trading signals

8.2 Training Phase

Learns model parameters
Uses historical tick datasets
Logs metrics

8.3 Validation Phase

Evaluates on unseen data
Detects overfitting
Tunes hyperparameters

8.4 Testing Phase

Final performance validation
Simulates real-world behavior
Ensures production readiness

Only fully validated models proceed to deployment.

8.5 Live Endpoint API

Exposes trained model via FastAPI
Accepts live tick data
Returns structured trade signals
Used by MT5 and trading systems

9️⃣ Symbol Backtest & Trading System (Optional)

This layer enables strategy evaluation before live deployment.

9.1 MT5 Historical Backtest

Uses MT5 historical API
Tests model-generated signals

9.2 Walk Forward Validation

Rolling window evaluation
Mimics production retraining cycles
Reduces regime overfitting

9.3 Deployment with Model

After passing walk-forward validation:

Model integrated into trading execution layer

🔟 Deployment Layer

10.1 MT5 Deployment

Models deployed to MetaTrader 5
Signals streamed from Live Endpoint API

10.2 Live / Demo Execution

Automated trade execution
Demo or live mode
Execution logic separated from model logic

1️⃣1️⃣ Model & Experiment Tracking

MLflow

Tracks:

Model versions
Hyperparameters
Training metrics
Artifacts

Provides:

Full audit trail
Experiment comparison
Reproducibility

1️⃣2️⃣ Design Principles

Reproducibility: Raw data preserved
Scalability: Distributed processing with Spark
Separation of Concerns: Data, ML, and execution isolated
Observability: Monitoring at all stages
Flexibility: Components can be upgraded independently

1️⃣3️⃣ Future Enhancements

Planned improvements:

Alternative data sources
Advanced feature engineering
Improved model architectures
Risk management layers
Enhanced monitoring dashboards

This architecture represents a structured, evolving ML trading system built for continuous research and production deployment.

YouTube Channel: https://youtu.be/igRmHdnu1u0

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
Apache Iceberg		Apache Iceberg
DVC		DVC
Data Preprocessing and Cleaning		Data Preprocessing and Cleaning
Documentation		Documentation
ML Model		ML Model
Sample Dataset		Sample Dataset
Symbol Data And Stats		Symbol Data And Stats
Gold Data Lake House architecture Feb 2026_V1.png		Gold Data Lake House architecture Feb 2026_V1.png
LICENSE.md		LICENSE.md
README.md		README.md
WARNINGS.md		WARNINGS.md

Folders and files

Latest commit

History

Repository files navigation

📈 Big Data Tick Data ML Trading Pipeline — Jan 2026

1️⃣ System Overview

2️⃣ Data Source Layer

2.1 Dukascopy Website

3️⃣ Data Replication Layer

3.1 Local → S3 Transfer (NiFi)

4️⃣ Raw CSV Storage

5️⃣ Data Processing Layer

5.1 Python Conversion Stage

5.2 Spark Processing

6️⃣ Data Version Control

7️⃣ Data Lakehouse Architecture

7.1 Storage Components

7.2 Lakehouse Tables

7.3 Query Engine

8️⃣ Machine Learning Pipeline (FastAPI UI)

8.1 Model Core

8.2 Training Phase

8.3 Validation Phase

8.4 Testing Phase

8.5 Live Endpoint API

9️⃣ Symbol Backtest & Trading System (Optional)

9.1 MT5 Historical Backtest

9.2 Walk Forward Validation

9.3 Deployment with Model

🔟 Deployment Layer

10.1 MT5 Deployment

10.2 Live / Demo Execution

1️⃣1️⃣ Model & Experiment Tracking

MLflow

1️⃣2️⃣ Design Principles

1️⃣3️⃣ Future Enhancements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages