Skip to content

tharu-jwd/ff5-portfolio-classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FF5 Portfolio Classification

Can classical ML predict which portfolios will beat the monthly median, using only Fama-French five-factor returns? That's the question. This repo has the data pipeline, four trained classifiers, and an IEEE conference paper with the results.

Best model: SVM (RBF) — 74.5% accuracy, 0.823 ROC-AUC on held-out test data.

What's the task?

Each month, 25 size/value-sorted portfolios either beat or miss the cross-sectional median return. We frame that as a binary classification problem. Features are the five FF5 factor realizations for that month plus the portfolio's size and value quintile.

Data

Download both files manually from the Kenneth R. French Data Library:

File What to download Save as
FF5 factors Fama/French 5 Factors (2x3) [Monthly] data/raw/ff5-factors-monthly.csv
25 portfolios 25 Portfolios Formed on Size and Book-to-Market (5x5) [Monthly] data/raw/portfolio-25-size-value.csv

data/raw/ is gitignored — the CSVs are free to download and don't need to be in the repo.

Repo layout

ff5-portfolio-classification/
├── data/
│   ├── raw/                          # gitignored; download manually
│   └── processed/
│       └── ml-dataset.csv           # generated by notebook 1
├── notebooks/
│   ├── 1-data-preprocessing.ipynb
│   ├── 2-exploratory-analysis.ipynb
│   ├── 3-model-training.ipynb
│   └── 4-results-evaluation.ipynb
├── results/
│   ├── model-comparison.csv
│   ├── roc-curves.png
│   ├── confusion-matrices.png
│   ├── feature-importance.png
│   ├── model-comparison-bar.png
│   └── precision-recall-curves.png
├── paper/
│   ├── paper.tex
│   ├── IEEEtrans.cls
│   └── references.bib
├── requirements.txt
└── README.md

Setup

pip install -r requirements.txt

Running

Run the notebooks in order:

  1. 1-data-preprocessing.ipynb — merges factor and portfolio data, creates the binary label, saves data/processed/ml-dataset.csv
  2. 2-exploratory-analysis.ipynb — distributions, correlations, win-rate heatmaps
  3. 3-model-training.ipynb — trains LR, SVM, RF, XGBoost with TimeSeriesSplit CV; saves results/model-comparison.csv and results/model-artifacts.pkl
  4. 4-results-evaluation.ipynb — generates all figures (ROC curves, confusion matrices, feature importance, etc.)
jupyter notebook

Results

Model CV ROC-AUC Test Accuracy Test ROC-AUC
Logistic Regression 0.533 49.3% 0.471
SVM (RBF) 0.807 74.5% 0.823
Random Forest 0.787 72.6% 0.809
Gradient Boosting (XGBoost) 0.809 73.4% 0.820

Train: 1963-07-31 to 2013-08-31. Test: 2013-08-31 to 2026-02-28.

Paper

The paper is in paper/paper.tex (IEEE conference format). To compile:

cd paper
pdflatex paper.tex
bibtex paper
pdflatex paper.tex
pdflatex paper.tex

About

Portfolio outperformance classification using Fama French 5 factors and machine learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors