This project implements a full-stack quantitative ML trading architecture built around:
- High-frequency tick ingestion
- Distributed data processing
- Lakehouse storage (Iceberg + S3 + Postgres catalog)
- Model lifecycle management
- Backtesting & walk-forward validation
- MT5 deployment for live/demo execution
- Experiment tracking
The system is modular, scalable, and production-oriented.
Source: Historical tick data
Format: Raw CSV
Granularity: Sub-second
Responsibilities:
- Download immutable historical tick data
- Maintain original dataset integrity
- Provide reproducible base data
Large finalized datasets are transferred:
- From local file system
- To S3 object storage
- Using Apache NiFi
NiFi handles:
- Transfer orchestration
- Backpressure control
- Flow monitoring
- Reliability
This layer ensures stable movement into the lakehouse.
Raw tick data is stored unmodified in CSV format.
Purpose:
- Immutable recovery layer
- Enables reprocessing
- Protects against transformation errors
- Maintains auditability
No transformations occur at this stage.
- Ingest raw CSV
- Convert CSV to Parquet
- Save as partitioned datasets
Benefits of partitioning:
- Faster reads
- Efficient Spark processing
- Scalable storage
Apache Spark performs:
- Large-scale cleaning
- Sorting
- Missing value handling
- Timestamp normalization
- Feature preparation
Output: ML-ready datasets
Final datasets are versioned using:
- DVC (Data Version Control)
Benefits:
- Reproducibility
- Dataset lineage
- Controlled experimentation
- Rollback capability
The lakehouse is the central storage and analytics layer.
- S3 → Stores Parquet data files
- Postgres → Stores Iceberg table metadata
- Apache Iceberg → Table format layer
Iceberg provides:
- ACID guarantees
- Schema evolution
- Snapshot isolation
- Time travel
Structured tables are created on top of S3 data using Iceberg.
These tables represent:
- Gold datasets (ML-ready)
Trino is used as the query engine.
Capabilities:
- SQL analytics on Iceberg tables
- Distributed query execution
- Research queries for feature engineering
- Dataset validation before ML usage
The ML system follows a strict lifecycle.
- Anomaly / pattern detection models
- Consumes gold datasets
- Produces trading signals
- Learns model parameters
- Uses historical tick datasets
- Logs metrics
- Evaluates on unseen data
- Detects overfitting
- Tunes hyperparameters
- Final performance validation
- Simulates real-world behavior
- Ensures production readiness
Only fully validated models proceed to deployment.
- Exposes trained model via FastAPI
- Accepts live tick data
- Returns structured trade signals
- Used by MT5 and trading systems
This layer enables strategy evaluation before live deployment.
- Uses MT5 historical API
- Tests model-generated signals
- Rolling window evaluation
- Mimics production retraining cycles
- Reduces regime overfitting
After passing walk-forward validation:
- Model integrated into trading execution layer
- Models deployed to MetaTrader 5
- Signals streamed from Live Endpoint API
- Automated trade execution
- Demo or live mode
- Execution logic separated from model logic
Tracks:
- Model versions
- Hyperparameters
- Training metrics
- Artifacts
Provides:
- Full audit trail
- Experiment comparison
- Reproducibility
- Reproducibility: Raw data preserved
- Scalability: Distributed processing with Spark
- Separation of Concerns: Data, ML, and execution isolated
- Observability: Monitoring at all stages
- Flexibility: Components can be upgraded independently
Planned improvements:
- Alternative data sources
- Advanced feature engineering
- Improved model architectures
- Risk management layers
- Enhanced monitoring dashboards
This architecture represents a structured, evolving ML trading system built for continuous research and production deployment.
YouTube Channel: https://youtu.be/igRmHdnu1u0
