Can classical ML predict which portfolios will beat the monthly median, using only Fama-French five-factor returns? That's the question. This repo has the data pipeline, four trained classifiers, and an IEEE conference paper with the results.
Best model: SVM (RBF) — 74.5% accuracy, 0.823 ROC-AUC on held-out test data.
Each month, 25 size/value-sorted portfolios either beat or miss the cross-sectional median return. We frame that as a binary classification problem. Features are the five FF5 factor realizations for that month plus the portfolio's size and value quintile.
Download both files manually from the Kenneth R. French Data Library:
| File | What to download | Save as |
|---|---|---|
| FF5 factors | Fama/French 5 Factors (2x3) [Monthly] | data/raw/ff5-factors-monthly.csv |
| 25 portfolios | 25 Portfolios Formed on Size and Book-to-Market (5x5) [Monthly] | data/raw/portfolio-25-size-value.csv |
data/raw/ is gitignored — the CSVs are free to download and don't need to be in the repo.
ff5-portfolio-classification/
├── data/
│ ├── raw/ # gitignored; download manually
│ └── processed/
│ └── ml-dataset.csv # generated by notebook 1
├── notebooks/
│ ├── 1-data-preprocessing.ipynb
│ ├── 2-exploratory-analysis.ipynb
│ ├── 3-model-training.ipynb
│ └── 4-results-evaluation.ipynb
├── results/
│ ├── model-comparison.csv
│ ├── roc-curves.png
│ ├── confusion-matrices.png
│ ├── feature-importance.png
│ ├── model-comparison-bar.png
│ └── precision-recall-curves.png
├── paper/
│ ├── paper.tex
│ ├── IEEEtrans.cls
│ └── references.bib
├── requirements.txt
└── README.md
pip install -r requirements.txtRun the notebooks in order:
1-data-preprocessing.ipynb— merges factor and portfolio data, creates the binary label, savesdata/processed/ml-dataset.csv2-exploratory-analysis.ipynb— distributions, correlations, win-rate heatmaps3-model-training.ipynb— trains LR, SVM, RF, XGBoost with TimeSeriesSplit CV; savesresults/model-comparison.csvandresults/model-artifacts.pkl4-results-evaluation.ipynb— generates all figures (ROC curves, confusion matrices, feature importance, etc.)
jupyter notebook| Model | CV ROC-AUC | Test Accuracy | Test ROC-AUC |
|---|---|---|---|
| Logistic Regression | 0.533 | 49.3% | 0.471 |
| SVM (RBF) | 0.807 | 74.5% | 0.823 |
| Random Forest | 0.787 | 72.6% | 0.809 |
| Gradient Boosting (XGBoost) | 0.809 | 73.4% | 0.820 |
Train: 1963-07-31 to 2013-08-31. Test: 2013-08-31 to 2026-02-28.
The paper is in paper/paper.tex (IEEE conference format). To compile:
cd paper
pdflatex paper.tex
bibtex paper
pdflatex paper.tex
pdflatex paper.tex