A minimal recommendation system framework using Metaflow for workflow orchestration.
This repository demonstrates how to build a production-ready recommendation system using Metaflow to solve the cross-selling problem.
Data: Uses Amazon product dataset (May 1996 – July 2014) with product metadata and user reviews. The data includes product relationships like "also_bought", "also_viewed", and "bought_together" to understand customer purchasing patterns.
Approach: Transforms user-item interactions into a graph-based representation where products are nodes and relationships (co-purchases, similar items) are edges. Generates random walks through the product graph to create training sequences for embedding models.
Models: Supports Word2Vec for product embeddings using skip-gram architecture, and Matrix Factorization for collaborative filtering, both with optional bias terms.
recsys-metaflow/
├── flows/ # Metaflow pipeline definitions
│ ├── data_flow.py # Data preparation pipeline
│ ├── model_flow.py # Model training pipeline
│ └── recommendation_deploy_flow.py # Recommendation generation pipeline
├── models/ # Model implementations
│ ├── word2vec.py # Word2Vec model
│ ├── matrix_factorization.py # Matrix factorization model
│ ├── training.py # Training utilities
│ └── datasets.py # Dataset classes
├── notebooks/ # Jupyter notebooks for exploration
└── DATA_README.md # Data documentation
When you run flows, Metaflow uses the @conda_base decorator - Metaflow handles dependencies automatically.
In order to create the workstation environment where we can run, develop, and debug our Metaflow flows,
create a local environment with Metaflow and any dependencies we'll need to workflows steps that run locally.
First, install mamba or use your preferred alternative conda version. Then run:
mamba env create -f environment-ob.yaml
mamba activate recsys-metaflow-obor
mamba env create -f environment-oss.yaml
mamba activate recsys-metaflow-ossProcesses Amazon dataset into graph relationships and training sequences:
python flows/data_flow.py --environment=fast-bakery run --category Electronics --sample_size 1000Note:
fast-bakeryis an Outerbounds only offering. Open-source users can usepypi,conda, or build docker images and assign tasks to them through compute decorators like@kubernetes,@slurm.
Parameters:
category: Amazon category (Electronics, Books, All_Beauty)sample_size: Number of records for testing (-1 for all data)test_size: Train/validation split ratio (default: 0.33)
Trains recommendation models on processed data (triggered by DataFlow completion):
python flows/model_flow.py --environment=fast-bakery --trigger DataFlow/1401 run --model_type word2vec --embedding_dim 128 --epochs 5Parameters:
--trigger DataFlow/[run_id]: Run ID from completed DataFlow (e.g., DataFlow/1401)- This is useful in debugging. When deploying to a production orchestrator, Metaflow automatically passed this information along.
model_type: Model to train (word2vec, mf, or mf_bias)embedding_dim: Embedding dimension (default: 128)batch_size: Training batch size (default: 128)epochs: Number of training epochs (default: 5)learning_rate: Learning rate (default: 0.01)
Generates recommendations using trained models (triggered by ModelFlow completion):
python flows/recommendation_deploy_flow.py --environment=fast-bakery --trigger ModelFlow/1402 run --top_k 10Parameters:
--trigger ModelFlow/[run_id]: Run ID from completed ModelFlow (e.g., ModelFlow/1402)top_k: Number of recommendations to generate (default: 10)
The implementation includes:
- Word2Vec: Skip-gram architecture for product embeddings
- Matrix Factorization: Classic collaborative filtering approach
- Matrix Factorization with Bias: Enhanced MF with bias terms
This example was authored by Servando Torres at ControlThrive in collaboration with Outerbounds.
If you have need assitance building systems like this, dont hesitate to reach out to [email protected].

