Skip to content

outerbounds/recsys-metaflow

Repository files navigation

Metaflow RecSys

Business Problem

A minimal recommendation system framework using Metaflow for workflow orchestration.

Overview

Metaflow System Architecture

This repository demonstrates how to build a production-ready recommendation system using Metaflow to solve the cross-selling problem.

Data: Uses Amazon product dataset (May 1996 – July 2014) with product metadata and user reviews. The data includes product relationships like "also_bought", "also_viewed", and "bought_together" to understand customer purchasing patterns.

Approach: Transforms user-item interactions into a graph-based representation where products are nodes and relationships (co-purchases, similar items) are edges. Generates random walks through the product graph to create training sequences for embedding models.

Models: Supports Word2Vec for product embeddings using skip-gram architecture, and Matrix Factorization for collaborative filtering, both with optional bias terms.

Structure

recsys-metaflow/
├── flows/                    # Metaflow pipeline definitions
│   ├── data_flow.py          # Data preparation pipeline
│   ├── model_flow.py         # Model training pipeline 
│   └── recommendation_deploy_flow.py # Recommendation generation pipeline
├── models/                   # Model implementations
│   ├── word2vec.py           # Word2Vec model
│   ├── matrix_factorization.py # Matrix factorization model
│   ├── training.py           # Training utilities
│   └── datasets.py          # Dataset classes
├── notebooks/                # Jupyter notebooks for exploration
└── DATA_README.md           # Data documentation

Getting Started

Running the Flows

Workstation setup

When you run flows, Metaflow uses the @conda_base decorator - Metaflow handles dependencies automatically. In order to create the workstation environment where we can run, develop, and debug our Metaflow flows, create a local environment with Metaflow and any dependencies we'll need to workflows steps that run locally.

First, install mamba or use your preferred alternative conda version. Then run:

mamba env create -f environment-ob.yaml
mamba activate recsys-metaflow-ob

or

mamba env create -f environment-oss.yaml
mamba activate recsys-metaflow-oss

Data Flow

Processes Amazon dataset into graph relationships and training sequences:

python flows/data_flow.py --environment=fast-bakery run --category Electronics --sample_size 1000

Note: fast-bakery is an Outerbounds only offering. Open-source users can use pypi, conda, or build docker images and assign tasks to them through compute decorators like @kubernetes, @slurm.

Parameters:

  • category: Amazon category (Electronics, Books, All_Beauty)
  • sample_size: Number of records for testing (-1 for all data)
  • test_size: Train/validation split ratio (default: 0.33)

Model Flow

Trains recommendation models on processed data (triggered by DataFlow completion):

python flows/model_flow.py --environment=fast-bakery  --trigger DataFlow/1401 run --model_type word2vec --embedding_dim 128 --epochs 5

Parameters:

  • --trigger DataFlow/[run_id]: Run ID from completed DataFlow (e.g., DataFlow/1401)
    • This is useful in debugging. When deploying to a production orchestrator, Metaflow automatically passed this information along.
  • model_type: Model to train (word2vec, mf, or mf_bias)
  • embedding_dim: Embedding dimension (default: 128)
  • batch_size: Training batch size (default: 128)
  • epochs: Number of training epochs (default: 5)
  • learning_rate: Learning rate (default: 0.01)

Recommendation Flow

Generates recommendations using trained models (triggered by ModelFlow completion):

python flows/recommendation_deploy_flow.py --environment=fast-bakery --trigger ModelFlow/1402 run --top_k 10

Parameters:

  • --trigger ModelFlow/[run_id]: Run ID from completed ModelFlow (e.g., ModelFlow/1402)
  • top_k: Number of recommendations to generate (default: 10)

Models

The implementation includes:

  1. Word2Vec: Skip-gram architecture for product embeddings
  2. Matrix Factorization: Classic collaborative filtering approach
  3. Matrix Factorization with Bias: Enhanced MF with bias terms

This example was authored by Servando Torres at ControlThrive in collaboration with Outerbounds.

If you have need assitance building systems like this, dont hesitate to reach out to [email protected].

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •