Skip to content

Commit eda70fd

Browse files
jendefigskshetry
andauthored
jupyter to DVC blog (#5477)
* jupyter to DVC blog * edits * adding correct image paths * add images * fix: pin node version for Heroku build * missing accent * updating version --------- Co-authored-by: Saugat Pachhai (सौगात) <[email protected]>
1 parent 18e8f87 commit eda70fd

File tree

3 files changed

+350
-4
lines changed

3 files changed

+350
-4
lines changed
Lines changed: 346 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,346 @@
1+
---
2+
title:
3+
'Transforming a Jupyter Notebook into a Reproducible Pipeline for Experiments
4+
with DVC'
5+
date: 2025-10-31
6+
description: >
7+
Rob De De Wit shares his Pokémon Generator project to demonstrate how you can
8+
move from Jupyter Notebook prototype to a production-ready pipeline with DVC.
9+
descriptionLong: >
10+
This blog post is an adaptation of Rob De Wit’s presentation on the subject
11+
using his Pokémon Generator project at PyData USA 2023. You can find [the
12+
video here](https://www.youtube.com/watch?v=sDhpIZQXe-w) and [the repo of the
13+
project here](https://github.com/RCdeWit/sd-pokemon-generator).
14+
15+
picture: 2025-10-31/jupyter-to-dvc-cover.png
16+
pictureComment:
17+
Learn how to transform your Jupyter Notebook prototype into a production-ready
18+
DVC pipeline.
19+
authors:
20+
- rob_dewit
21+
tags:
22+
- Open Source
23+
- DVC
24+
- Tutorial
25+
- Guide
26+
- Jupyter Notebook
27+
- Pokémon
28+
- LoRA
29+
---
30+
31+
When we experiment with machine learning models, it’s easy to get lost in the
32+
cycle of trying new parameters, swapping datasets, or adjusting architectures.
33+
That’s how progress is made, but without structure, reproducibility, and
34+
tracking, you risk losing valuable results or being unable to explain why a
35+
model worked (or failed).
36+
37+
In this post, I use a Pokémon generator I created with
38+
[LoRA](https://huggingface.co/docs/diffusers/training/lora) (Low-Rank Adaptation
39+
of Large Language Models) to demonstrate how I approach turning one-off
40+
prototypes into structured, reproducible pipelines using versioned data,
41+
parameters, and experiments.
42+
43+
![Pokémon image generated with LoRA](../uploads/images/2025-10-31/pokemon-generator-output.png '=600')
44+
45+
## Why Reproducibility Matters
46+
47+
Reproducibility is the backbone of science, and machine learning is no
48+
different. A reproducible experiment means:
49+
50+
- The same combination of **data, code, and parameters** produces the same
51+
result.
52+
- You can trace back decisions: **which dataset, which hyperparameters, which
53+
preprocessing steps**.
54+
- **Collaboration** can be acheived. Your colleagues (and your future self!) can
55+
understand and build upon your work.
56+
57+
When I worked on earlier projects, we often had to reconstruct models after the
58+
fact, trying to remember what went into training them. That experience convinced
59+
me that reproducibility should be built into every pipeline from day one. That’s
60+
why I treat every experiment as a **deterministic combination of code + data +
61+
parameters**, and I build pipelines that make this explicit.
62+
63+
## Moving from Jupyter Notebook to a Reproducible Pipeline
64+
65+
Jupyter notebooks are a great tool for prototyping data science projects, but
66+
not the best go-to when needing to reproduce your results. Part of the greatness
67+
is the ability to easily change cells and re-run sections, visualize data
68+
in-line with code, and share analysis with narrative text. But on the flip side,
69+
these benefits can lead to a breakdown in your ability to accurately reproduce
70+
results. Notebooks are also challenging to test, scale, and manage dependencies.
71+
So how can we set up our pipeline for reproducible success?
72+
73+
![Git plus DVC](../uploads/images/2025-10-31/git-plus-dvc.png '=600')
74+
75+
## Enter DVC
76+
77+
If you’ve ever tried to manage large files with Git, you have come to realize
78+
that Git in and of itself is not sufficient. DVC operates like Git, but for
79+
large data, models, and your machine learning experimentation process,
80+
versioning everything along with the code. Let’s see how this works under the
81+
hood.
82+
83+
## How DVC tracks your Data
84+
85+
![DVC dataset tracking](../uploads/images/2025-10-31/dataset-images.png '=600')
86+
87+
In your Git repository, you have a main branch with commits and a branch with a
88+
dataset, in this case a dataset of Pokémon images.
89+
90+
![DVC metadata](../uploads/images/2025-10-31/dataset-metadata.png '=600')
91+
92+
As image data are large, we do not want to keep them in Git so DVC replaces them
93+
with a metadata file. The metadata for the dataset contains the hash, the size,
94+
number of files, and other data.
95+
96+
![DVC dataset hash](../uploads/images/2025-10-31/dataset-hash.png '=600')
97+
98+
The hash in Git points to the .dvc/cache, which is where the physical images are
99+
actually stored on your file system.
100+
101+
![DVC new dataset hash](../uploads/images/2025-10-31/new-dataset-hash.png '=600')
102+
103+
If you create another commit with a different dataset (noted by a different font
104+
on the left in the Git repo). A new hash will point to the new dataset in the
105+
.dvc/cache. In this case, one image was removed and one added with two staying
106+
the same.
107+
108+
## How a DVC Pipeline Works
109+
110+
Below you will find the sections of the Jupyter Notebook on the left. Each of
111+
these stages produces outputs as seen on the right.
112+
113+
![Machine learning stages and outputs](../uploads/images/2025-10-31/pipeline-outputs.png '=600')
114+
115+
As these are now specified, they can be used as downstream dependencies for
116+
other stages.
117+
118+
![Pipeline dependencies](../uploads/images/2025-10-31/pipeline-dependencies.png '=600')
119+
120+
So if the `train_lora` stage is dependent on the processed images, we can ensure
121+
that the stage only triggers once there are new images in the processed
122+
directory. Additionally, if we make a change in the `train_lora` stage, none of
123+
the previous stages that were not changed will need to be run with DVC, saving
124+
you development time.
125+
126+
## How Tracking Experiments Work with Git and DVC
127+
128+
In addition to your data and pipeline, DVC can version your experiments along
129+
with Git. We use DVC for larger files and Git for the smaller files.
130+
131+
![Experiment tracking with Git and DVC](../uploads/images/2025-10-31/track-experiments-dvc-git.png '=600')
132+
133+
All of these things together represent an experiment and can be recorded as a
134+
git commit with a hash. This way this experiment and all its modifications will
135+
be able to be reproduced using a `git checkout` and `dvc checkout` with its hash
136+
(See experiment hash noted at bottom.)
137+
138+
![Each experiment can receive a hash](../uploads/images/2025-10-31/experiment-commit-hash.png '=600')
139+
140+
## Converting from a Jupyter Notebook to a DVC Project
141+
142+
We will set up our Pokémon generator as a DVC project.
143+
144+
### Building the Pipeline
145+
146+
Here’s the approach I’ve taken to bring structure into experimentation:
147+
148+
1. **Start with a Base Model.** Use an off-the-shelf model as your foundation.
149+
Fine-tune it, adapt it, and make it your own, but always know what version
150+
you started from.
151+
2. **Track Everything.** Every dataset, parameter, and code change should be
152+
versioned. We can use DVC for this. Think of it like Git for your machine
153+
learning workflow: commits that point not just to code, but to data and model
154+
states.
155+
3. **Modularize the Workflow.** Break experiments into stages: data prep,
156+
training, evaluation, etc. That way, you can rerun only what changes instead
157+
of starting from scratch every time.
158+
4. **Run Reproducible Experiments.** Each experiment should be captured so you
159+
can roll back, compare results, and build confidence in the best-performing
160+
model.
161+
5. **Move Toward Production.** Once an experiment proves itself, package it into
162+
a pipeline that can run with a single command. That pipeline is what bridges
163+
the gap between “something interesting in a notebook” and “a reliable system
164+
in production.”
165+
166+
### Step 1: Define the Pipeline
167+
168+
Start by breaking your workflow into stages. For example, in this project the
169+
dvc.yaml looks like this:
170+
171+
```yaml
172+
stages:
173+
set_up_diffusers:
174+
cmd: |
175+
git clone --depth 1 --branch v0.14.0 https://github.com/huggingface/diffusers.git diffusers
176+
pip3.10 install -r "diffusers/examples/dreambooth/requirements.txt"
177+
accelerate config default
178+
outs:
179+
- diffusers:
180+
cache: false
181+
scrape_pokemon_images:
182+
cmd: python3 src/scrape_pokemon_images.py --params params.yaml
183+
deps:
184+
- src/scrape_pokemon_images.py
185+
outs:
186+
- data/external/pokemon
187+
download_pokemon_stats:
188+
cmd:
189+
kaggle datasets download -d brdata/complete-pokemon-dataset-gen-iiv -f
190+
Pokedex_Cleaned.csv -p data/external/
191+
outs:
192+
- data/external/Pokedex_Cleaned.csv
193+
resize_pokemon_images:
194+
cmd: python3 src/resize_pokemon_images.py --params params.yaml
195+
deps:
196+
- src/resize_pokemon_images.py
197+
- data/external/pokemon
198+
- data/external/Pokedex_Cleaned.csv
199+
outs:
200+
- data/processed/pokemon
201+
params:
202+
- base
203+
- data_etl
204+
train_lora:
205+
cmd: >
206+
accelerate launch --mps
207+
"diffusers/examples/dreambooth/train_dreambooth_lora.py"
208+
--pretrained_model_name_or_path=${train_lora.base_model}
209+
--instance_data_dir=${data_etl.train_data_path}
210+
--output_dir=${train_lora.lora_path} --instance_prompt='a pkmnlora
211+
pokemon' --resolution=512 --train_batch_size=1
212+
--gradient_accumulation_steps=1 --checkpointing_steps=500
213+
--learning_rate=${train_lora.learning_rate} --lr_scheduler='cosine'
214+
--lr_warmup_steps=0 --max_train_steps=${train_lora.max_train_steps}
215+
--seed=${train_lora.seed}
216+
deps:
217+
- diffusers
218+
- data/processed/pokemon
219+
outs:
220+
- models/pkmnlora
221+
params:
222+
- data_etl
223+
- train_lora
224+
generate_text_to_image:
225+
cmd: python3 src/generate_text_to_image.py --params params.yaml
226+
outs:
227+
- outputs
228+
deps:
229+
- src/generate_text_to_image.py
230+
- models/pkmnlora
231+
params:
232+
- train_lora
233+
- generate_text_to_image
234+
```
235+
236+
Each stage declares:
237+
238+
- **Command (cmd)** – what to run
239+
- **Dependencies (deps)** – inputs the stage needs
240+
- **Outputs (outs)** – files the stage produces. This way, when you change a
241+
dependency (e.g., a new dataset or updated parameter), only the affected
242+
stages re-run.
243+
244+
### Step 2: Track Parameters
245+
246+
Instead of hardcoding hyperparameters, keep them in a structured file like
247+
params.yaml:
248+
249+
```yaml
250+
base:
251+
train_pokemon_type: all
252+
253+
data_etl:
254+
external_data_path: 'data/external/'
255+
train_data_path: 'data/processed/pokemon'
256+
257+
train_lora:
258+
seed: 1337
259+
model_directory: 'models'
260+
base_model: 'runwayml/stable-diffusion-v1-5'
261+
lora_path: 'models/pkmnlora'
262+
learning_rate: 0.0001
263+
max_train_steps: 15000
264+
265+
generate_text_to_image:
266+
seed: 3000
267+
num_inference_steps: 35
268+
batch_size: 1
269+
batch_count: 20
270+
prompt: 'a pkmnlora pokemon'
271+
negative_prompt: ''
272+
output_directory: 'outputs'
273+
use_lora: True
274+
```
275+
276+
Now you can run controlled experiments:
277+
278+
```bash
279+
$ dvc exp run -S training.learning_rate=0.01
280+
```
281+
282+
This will execute the pipeline with the updated parameter, track the run, and
283+
save results.
284+
285+
### Step 3: Track Experiments
286+
287+
For this Pokémon project, it’s not as relevant because the results are images
288+
with subjective grading. But with projects where you’re tracking metrics, with
289+
pipelines defined and parameters externalized, you can now compare experiments
290+
systematically:
291+
292+
```bash
293+
$ dvc exp show
294+
```
295+
296+
Example output:
297+
298+
| Experiment | train.learning_rate | train.epochs | Accuracy | Loss |
299+
| ---------- | ------------------- | ------------ | -------- | ---- |
300+
| baseline | 0.001 | 10 | 0.82 | 0.41 |
301+
| exp-1234 | 0.01 | 10 | 0.85 | 0.37 |
302+
| exp-5678 | 0.001 | 20 | 0.84 | 0.39 |
303+
304+
This makes it easy to see how parameter changes affect performance—without
305+
losing reproducibility.
306+
307+
### Step 4: Move Toward Production
308+
309+
Once you’re confident in a pipeline:
310+
311+
1. Lock the configuration – commit your dvc.yaml and params.yaml.
312+
2. Version your data – every dataset version is tracked (no guessing which CSV
313+
was used).
314+
3. Promote a model – move the best checkpoint into a production/ folder or model
315+
registry. Then your entire workflow can be reproduced with a single command:
316+
317+
```bash
318+
$ dvc repro
319+
```
320+
321+
That runs the whole pipeline—data prep, training, evaluation—with the exact same
322+
inputs and parameters.
323+
324+
## Lessons Learned
325+
326+
- **Reproducibility = productivity**. You spend less time debugging “mystery
327+
results.”
328+
- **Experiment tracking is collaborative**. Colleagues can see exactly what you
329+
tried, what worked, and what didn’t.
330+
- **Pipelines scale**. What starts as a notebook prototype can evolve into a
331+
production-ready workflow.
332+
333+
## Final Thoughts
334+
335+
Experimentation will always be messy—but pipelines don’t have to be. By
336+
structuring workflows into reproducible pipelines, you get the freedom to
337+
explore while ensuring you can always reproduce and explain your results. If
338+
you’d like to try this yourself, check out the
339+
[example pipeline](https://github.com/RCdeWit/sd-pokemon-generator) repo and the
340+
[docs](https:dvc.org/doc) for more info on building workflows specific to your
341+
project.
342+
343+
---
344+
345+
📰 [Join our Newsletter](https://share.hsforms.com/1KRL5_dTbQMKfV7nDD6V-8g4sbyq)
346+
to stay up to date with news and contributions from the Community!

content/uploads.dvc

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
outs:
2-
- md5: 1ec64a68f448517d70625e2a6ed40bb4.dir
3-
size: 365509886
4-
nfiles: 709
2+
- md5: 458e6151a0b25573f28656cd0d2f4f35.dir
3+
size: 365942786
4+
nfiles: 720
55
hash: md5
66
path: uploads

package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@
4141
},
4242
"homepage": "https://github.com/iterative/dvc.org#readme",
4343
"engines": {
44-
"node": ">=18.x <=22.x"
44+
"node": "22.x"
4545
},
4646
"private": true,
4747
"workspaces": [

0 commit comments

Comments
 (0)