Metaflow RecSys

A minimal recommendation system framework using Metaflow for workflow orchestration.

Overview

This repository demonstrates how to build a production-ready recommendation system using Metaflow to solve the cross-selling problem.

Data: Uses Amazon product dataset (May 1996 – July 2014) with product metadata and user reviews. The data includes product relationships like "also_bought", "also_viewed", and "bought_together" to understand customer purchasing patterns.

Approach: Transforms user-item interactions into a graph-based representation where products are nodes and relationships (co-purchases, similar items) are edges. Generates random walks through the product graph to create training sequences for embedding models.

Models: Supports Word2Vec for product embeddings using skip-gram architecture, and Matrix Factorization for collaborative filtering, both with optional bias terms.

Structure

recsys-metaflow/
├── flows/                    # Metaflow pipeline definitions
│   ├── data_flow.py          # Data preparation pipeline
│   ├── model_flow.py         # Model training pipeline 
│   └── recommendation_deploy_flow.py # Recommendation generation pipeline
├── models/                   # Model implementations
│   ├── word2vec.py           # Word2Vec model
│   ├── matrix_factorization.py # Matrix factorization model
│   ├── training.py           # Training utilities
│   └── datasets.py          # Dataset classes
├── notebooks/                # Jupyter notebooks for exploration
└── DATA_README.md           # Data documentation

Getting Started

Running the Flows

Workstation setup

When you run flows, Metaflow uses the @conda_base decorator - Metaflow handles dependencies automatically. In order to create the workstation environment where we can run, develop, and debug our Metaflow flows, create a local environment with Metaflow and any dependencies we'll need to workflows steps that run locally.

First, install mamba or use your preferred alternative conda version. Then run:

mamba env create -f environment-ob.yaml
mamba activate recsys-metaflow-ob

or

mamba env create -f environment-oss.yaml
mamba activate recsys-metaflow-oss

Data Flow

Processes Amazon dataset into graph relationships and training sequences:

python flows/data_flow.py --environment=fast-bakery run --category Electronics --sample_size 1000

Note: fast-bakery is an Outerbounds only offering. Open-source users can use pypi, conda, or build docker images and assign tasks to them through compute decorators like @kubernetes, @slurm.

Parameters:

category: Amazon category (Electronics, Books, All_Beauty)
sample_size: Number of records for testing (-1 for all data)
test_size: Train/validation split ratio (default: 0.33)

Model Flow

Trains recommendation models on processed data (triggered by DataFlow completion):

python flows/model_flow.py --environment=fast-bakery  --trigger DataFlow/1401 run --model_type word2vec --embedding_dim 128 --epochs 5

Parameters:

--trigger DataFlow/[run_id]: Run ID from completed DataFlow (e.g., DataFlow/1401)
- This is useful in debugging. When deploying to a production orchestrator, Metaflow automatically passed this information along.
model_type: Model to train (word2vec, mf, or mf_bias)
embedding_dim: Embedding dimension (default: 128)
batch_size: Training batch size (default: 128)
epochs: Number of training epochs (default: 5)
learning_rate: Learning rate (default: 0.01)

Recommendation Flow

Generates recommendations using trained models (triggered by ModelFlow completion):

python flows/recommendation_deploy_flow.py --environment=fast-bakery --trigger ModelFlow/1402 run --top_k 10

Parameters:

--trigger ModelFlow/[run_id]: Run ID from completed ModelFlow (e.g., ModelFlow/1402)
top_k: Number of recommendations to generate (default: 10)

Models

The implementation includes:

Word2Vec: Skip-gram architecture for product embeddings
Matrix Factorization: Classic collaborative filtering approach
Matrix Factorization with Bias: Enhanced MF with bias terms

This example was authored by Servando Torres at ControlThrive in collaboration with Outerbounds.

If you have need assitance building systems like this, dont hesitate to reach out to [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
flows		flows
images		images
models		models
notebooks		notebooks
recsys_metaflow.egg-info		recsys_metaflow.egg-info
.gitignore		.gitignore
DATA_README.md		DATA_README.md
README.md		README.md
__init__.py		__init__.py
environment-ob.yml		environment-ob.yml
environment-oss.yml		environment-oss.yml
recsys_research.md		recsys_research.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Metaflow RecSys

Overview

Structure

Getting Started

Running the Flows

Workstation setup

Data Flow

Model Flow

Recommendation Flow

Models

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Uh oh!

Uh oh!

outerbounds/recsys-metaflow

Folders and files

Latest commit

History

Repository files navigation

Metaflow RecSys

Overview

Structure

Getting Started

Running the Flows

Workstation setup

Data Flow

Model Flow

Recommendation Flow

Models

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages