|
| 1 | +--- |
| 2 | +title: |
| 3 | + 'Transforming a Jupyter Notebook into a Reproducible Pipeline for Experiments |
| 4 | + with DVC' |
| 5 | +date: 2025-10-31 |
| 6 | +description: > |
| 7 | + Rob De De Wit shares his Pokémon Generator project to demonstrate how you can |
| 8 | + move from Jupyter Notebook prototype to a production-ready pipeline with DVC. |
| 9 | +descriptionLong: > |
| 10 | + This blog post is an adaptation of Rob De Wit’s presentation on the subject |
| 11 | + using his Pokémon Generator project at PyData USA 2023. You can find [the |
| 12 | + video here](https://www.youtube.com/watch?v=sDhpIZQXe-w) and [the repo of the |
| 13 | + project here](https://github.com/RCdeWit/sd-pokemon-generator). |
| 14 | +
|
| 15 | +picture: 2025-10-31/jupyter-to-dvc-cover.png |
| 16 | +pictureComment: |
| 17 | + Learn how to transform your Jupyter Notebook prototype into a production-ready |
| 18 | + DVC pipeline. |
| 19 | +authors: |
| 20 | + - rob_dewit |
| 21 | +tags: |
| 22 | + - Open Source |
| 23 | + - DVC |
| 24 | + - Tutorial |
| 25 | + - Guide |
| 26 | + - Jupyter Notebook |
| 27 | + - Pokémon |
| 28 | + - LoRA |
| 29 | +--- |
| 30 | + |
| 31 | +When we experiment with machine learning models, it’s easy to get lost in the |
| 32 | +cycle of trying new parameters, swapping datasets, or adjusting architectures. |
| 33 | +That’s how progress is made, but without structure, reproducibility, and |
| 34 | +tracking, you risk losing valuable results or being unable to explain why a |
| 35 | +model worked (or failed). |
| 36 | + |
| 37 | +In this post, I use a Pokémon generator I created with |
| 38 | +[LoRA](https://huggingface.co/docs/diffusers/training/lora) (Low-Rank Adaptation |
| 39 | +of Large Language Models) to demonstrate how I approach turning one-off |
| 40 | +prototypes into structured, reproducible pipelines using versioned data, |
| 41 | +parameters, and experiments. |
| 42 | + |
| 43 | + |
| 44 | + |
| 45 | +## Why Reproducibility Matters |
| 46 | + |
| 47 | +Reproducibility is the backbone of science, and machine learning is no |
| 48 | +different. A reproducible experiment means: |
| 49 | + |
| 50 | +- The same combination of **data, code, and parameters** produces the same |
| 51 | + result. |
| 52 | +- You can trace back decisions: **which dataset, which hyperparameters, which |
| 53 | + preprocessing steps**. |
| 54 | +- **Collaboration** can be acheived. Your colleagues (and your future self!) can |
| 55 | + understand and build upon your work. |
| 56 | + |
| 57 | +When I worked on earlier projects, we often had to reconstruct models after the |
| 58 | +fact, trying to remember what went into training them. That experience convinced |
| 59 | +me that reproducibility should be built into every pipeline from day one. That’s |
| 60 | +why I treat every experiment as a **deterministic combination of code + data + |
| 61 | +parameters**, and I build pipelines that make this explicit. |
| 62 | + |
| 63 | +## Moving from Jupyter Notebook to a Reproducible Pipeline |
| 64 | + |
| 65 | +Jupyter notebooks are a great tool for prototyping data science projects, but |
| 66 | +not the best go-to when needing to reproduce your results. Part of the greatness |
| 67 | +is the ability to easily change cells and re-run sections, visualize data |
| 68 | +in-line with code, and share analysis with narrative text. But on the flip side, |
| 69 | +these benefits can lead to a breakdown in your ability to accurately reproduce |
| 70 | +results. Notebooks are also challenging to test, scale, and manage dependencies. |
| 71 | +So how can we set up our pipeline for reproducible success? |
| 72 | + |
| 73 | + |
| 74 | + |
| 75 | +## Enter DVC |
| 76 | + |
| 77 | +If you’ve ever tried to manage large files with Git, you have come to realize |
| 78 | +that Git in and of itself is not sufficient. DVC operates like Git, but for |
| 79 | +large data, models, and your machine learning experimentation process, |
| 80 | +versioning everything along with the code. Let’s see how this works under the |
| 81 | +hood. |
| 82 | + |
| 83 | +## How DVC tracks your Data |
| 84 | + |
| 85 | + |
| 86 | + |
| 87 | +In your Git repository, you have a main branch with commits and a branch with a |
| 88 | +dataset, in this case a dataset of Pokémon images. |
| 89 | + |
| 90 | + |
| 91 | + |
| 92 | +As image data are large, we do not want to keep them in Git so DVC replaces them |
| 93 | +with a metadata file. The metadata for the dataset contains the hash, the size, |
| 94 | +number of files, and other data. |
| 95 | + |
| 96 | + |
| 97 | + |
| 98 | +The hash in Git points to the .dvc/cache, which is where the physical images are |
| 99 | +actually stored on your file system. |
| 100 | + |
| 101 | + |
| 102 | + |
| 103 | +If you create another commit with a different dataset (noted by a different font |
| 104 | +on the left in the Git repo). A new hash will point to the new dataset in the |
| 105 | +.dvc/cache. In this case, one image was removed and one added with two staying |
| 106 | +the same. |
| 107 | + |
| 108 | +## How a DVC Pipeline Works |
| 109 | + |
| 110 | +Below you will find the sections of the Jupyter Notebook on the left. Each of |
| 111 | +these stages produces outputs as seen on the right. |
| 112 | + |
| 113 | + |
| 114 | + |
| 115 | +As these are now specified, they can be used as downstream dependencies for |
| 116 | +other stages. |
| 117 | + |
| 118 | + |
| 119 | + |
| 120 | +So if the `train_lora` stage is dependent on the processed images, we can ensure |
| 121 | +that the stage only triggers once there are new images in the processed |
| 122 | +directory. Additionally, if we make a change in the `train_lora` stage, none of |
| 123 | +the previous stages that were not changed will need to be run with DVC, saving |
| 124 | +you development time. |
| 125 | + |
| 126 | +## How Tracking Experiments Work with Git and DVC |
| 127 | + |
| 128 | +In addition to your data and pipeline, DVC can version your experiments along |
| 129 | +with Git. We use DVC for larger files and Git for the smaller files. |
| 130 | + |
| 131 | + |
| 132 | + |
| 133 | +All of these things together represent an experiment and can be recorded as a |
| 134 | +git commit with a hash. This way this experiment and all its modifications will |
| 135 | +be able to be reproduced using a `git checkout` and `dvc checkout` with its hash |
| 136 | +(See experiment hash noted at bottom.) |
| 137 | + |
| 138 | + |
| 139 | + |
| 140 | +## Converting from a Jupyter Notebook to a DVC Project |
| 141 | + |
| 142 | +We will set up our Pokémon generator as a DVC project. |
| 143 | + |
| 144 | +### Building the Pipeline |
| 145 | + |
| 146 | +Here’s the approach I’ve taken to bring structure into experimentation: |
| 147 | + |
| 148 | +1. **Start with a Base Model.** Use an off-the-shelf model as your foundation. |
| 149 | + Fine-tune it, adapt it, and make it your own, but always know what version |
| 150 | + you started from. |
| 151 | +2. **Track Everything.** Every dataset, parameter, and code change should be |
| 152 | + versioned. We can use DVC for this. Think of it like Git for your machine |
| 153 | + learning workflow: commits that point not just to code, but to data and model |
| 154 | + states. |
| 155 | +3. **Modularize the Workflow.** Break experiments into stages: data prep, |
| 156 | + training, evaluation, etc. That way, you can rerun only what changes instead |
| 157 | + of starting from scratch every time. |
| 158 | +4. **Run Reproducible Experiments.** Each experiment should be captured so you |
| 159 | + can roll back, compare results, and build confidence in the best-performing |
| 160 | + model. |
| 161 | +5. **Move Toward Production.** Once an experiment proves itself, package it into |
| 162 | + a pipeline that can run with a single command. That pipeline is what bridges |
| 163 | + the gap between “something interesting in a notebook” and “a reliable system |
| 164 | + in production.” |
| 165 | + |
| 166 | +### Step 1: Define the Pipeline |
| 167 | + |
| 168 | +Start by breaking your workflow into stages. For example, in this project the |
| 169 | +dvc.yaml looks like this: |
| 170 | + |
| 171 | +```yaml |
| 172 | +stages: |
| 173 | + set_up_diffusers: |
| 174 | + cmd: | |
| 175 | + git clone --depth 1 --branch v0.14.0 https://github.com/huggingface/diffusers.git diffusers |
| 176 | + pip3.10 install -r "diffusers/examples/dreambooth/requirements.txt" |
| 177 | + accelerate config default |
| 178 | + outs: |
| 179 | + - diffusers: |
| 180 | + cache: false |
| 181 | + scrape_pokemon_images: |
| 182 | + cmd: python3 src/scrape_pokemon_images.py --params params.yaml |
| 183 | + deps: |
| 184 | + - src/scrape_pokemon_images.py |
| 185 | + outs: |
| 186 | + - data/external/pokemon |
| 187 | + download_pokemon_stats: |
| 188 | + cmd: |
| 189 | + kaggle datasets download -d brdata/complete-pokemon-dataset-gen-iiv -f |
| 190 | + Pokedex_Cleaned.csv -p data/external/ |
| 191 | + outs: |
| 192 | + - data/external/Pokedex_Cleaned.csv |
| 193 | + resize_pokemon_images: |
| 194 | + cmd: python3 src/resize_pokemon_images.py --params params.yaml |
| 195 | + deps: |
| 196 | + - src/resize_pokemon_images.py |
| 197 | + - data/external/pokemon |
| 198 | + - data/external/Pokedex_Cleaned.csv |
| 199 | + outs: |
| 200 | + - data/processed/pokemon |
| 201 | + params: |
| 202 | + - base |
| 203 | + - data_etl |
| 204 | + train_lora: |
| 205 | + cmd: > |
| 206 | + accelerate launch --mps |
| 207 | + "diffusers/examples/dreambooth/train_dreambooth_lora.py" |
| 208 | + --pretrained_model_name_or_path=${train_lora.base_model} |
| 209 | + --instance_data_dir=${data_etl.train_data_path} |
| 210 | + --output_dir=${train_lora.lora_path} --instance_prompt='a pkmnlora |
| 211 | + pokemon' --resolution=512 --train_batch_size=1 |
| 212 | + --gradient_accumulation_steps=1 --checkpointing_steps=500 |
| 213 | + --learning_rate=${train_lora.learning_rate} --lr_scheduler='cosine' |
| 214 | + --lr_warmup_steps=0 --max_train_steps=${train_lora.max_train_steps} |
| 215 | + --seed=${train_lora.seed} |
| 216 | + deps: |
| 217 | + - diffusers |
| 218 | + - data/processed/pokemon |
| 219 | + outs: |
| 220 | + - models/pkmnlora |
| 221 | + params: |
| 222 | + - data_etl |
| 223 | + - train_lora |
| 224 | + generate_text_to_image: |
| 225 | + cmd: python3 src/generate_text_to_image.py --params params.yaml |
| 226 | + outs: |
| 227 | + - outputs |
| 228 | + deps: |
| 229 | + - src/generate_text_to_image.py |
| 230 | + - models/pkmnlora |
| 231 | + params: |
| 232 | + - train_lora |
| 233 | + - generate_text_to_image |
| 234 | +``` |
| 235 | +
|
| 236 | +Each stage declares: |
| 237 | +
|
| 238 | +- **Command (cmd)** – what to run |
| 239 | +- **Dependencies (deps)** – inputs the stage needs |
| 240 | +- **Outputs (outs)** – files the stage produces. This way, when you change a |
| 241 | + dependency (e.g., a new dataset or updated parameter), only the affected |
| 242 | + stages re-run. |
| 243 | +
|
| 244 | +### Step 2: Track Parameters |
| 245 | +
|
| 246 | +Instead of hardcoding hyperparameters, keep them in a structured file like |
| 247 | +params.yaml: |
| 248 | +
|
| 249 | +```yaml |
| 250 | +base: |
| 251 | + train_pokemon_type: all |
| 252 | + |
| 253 | +data_etl: |
| 254 | + external_data_path: 'data/external/' |
| 255 | + train_data_path: 'data/processed/pokemon' |
| 256 | + |
| 257 | +train_lora: |
| 258 | + seed: 1337 |
| 259 | + model_directory: 'models' |
| 260 | + base_model: 'runwayml/stable-diffusion-v1-5' |
| 261 | + lora_path: 'models/pkmnlora' |
| 262 | + learning_rate: 0.0001 |
| 263 | + max_train_steps: 15000 |
| 264 | + |
| 265 | +generate_text_to_image: |
| 266 | + seed: 3000 |
| 267 | + num_inference_steps: 35 |
| 268 | + batch_size: 1 |
| 269 | + batch_count: 20 |
| 270 | + prompt: 'a pkmnlora pokemon' |
| 271 | + negative_prompt: '' |
| 272 | + output_directory: 'outputs' |
| 273 | + use_lora: True |
| 274 | +``` |
| 275 | +
|
| 276 | +Now you can run controlled experiments: |
| 277 | +
|
| 278 | +```bash |
| 279 | +$ dvc exp run -S training.learning_rate=0.01 |
| 280 | +``` |
| 281 | + |
| 282 | +This will execute the pipeline with the updated parameter, track the run, and |
| 283 | +save results. |
| 284 | + |
| 285 | +### Step 3: Track Experiments |
| 286 | + |
| 287 | +For this Pokémon project, it’s not as relevant because the results are images |
| 288 | +with subjective grading. But with projects where you’re tracking metrics, with |
| 289 | +pipelines defined and parameters externalized, you can now compare experiments |
| 290 | +systematically: |
| 291 | + |
| 292 | +```bash |
| 293 | +$ dvc exp show |
| 294 | +``` |
| 295 | + |
| 296 | +Example output: |
| 297 | + |
| 298 | +| Experiment | train.learning_rate | train.epochs | Accuracy | Loss | |
| 299 | +| ---------- | ------------------- | ------------ | -------- | ---- | |
| 300 | +| baseline | 0.001 | 10 | 0.82 | 0.41 | |
| 301 | +| exp-1234 | 0.01 | 10 | 0.85 | 0.37 | |
| 302 | +| exp-5678 | 0.001 | 20 | 0.84 | 0.39 | |
| 303 | + |
| 304 | +This makes it easy to see how parameter changes affect performance—without |
| 305 | +losing reproducibility. |
| 306 | + |
| 307 | +### Step 4: Move Toward Production |
| 308 | + |
| 309 | +Once you’re confident in a pipeline: |
| 310 | + |
| 311 | +1. Lock the configuration – commit your dvc.yaml and params.yaml. |
| 312 | +2. Version your data – every dataset version is tracked (no guessing which CSV |
| 313 | + was used). |
| 314 | +3. Promote a model – move the best checkpoint into a production/ folder or model |
| 315 | + registry. Then your entire workflow can be reproduced with a single command: |
| 316 | + |
| 317 | +```bash |
| 318 | +$ dvc repro |
| 319 | +``` |
| 320 | + |
| 321 | +That runs the whole pipeline—data prep, training, evaluation—with the exact same |
| 322 | +inputs and parameters. |
| 323 | + |
| 324 | +## Lessons Learned |
| 325 | + |
| 326 | +- **Reproducibility = productivity**. You spend less time debugging “mystery |
| 327 | + results.” |
| 328 | +- **Experiment tracking is collaborative**. Colleagues can see exactly what you |
| 329 | + tried, what worked, and what didn’t. |
| 330 | +- **Pipelines scale**. What starts as a notebook prototype can evolve into a |
| 331 | + production-ready workflow. |
| 332 | + |
| 333 | +## Final Thoughts |
| 334 | + |
| 335 | +Experimentation will always be messy—but pipelines don’t have to be. By |
| 336 | +structuring workflows into reproducible pipelines, you get the freedom to |
| 337 | +explore while ensuring you can always reproduce and explain your results. If |
| 338 | +you’d like to try this yourself, check out the |
| 339 | +[example pipeline](https://github.com/RCdeWit/sd-pokemon-generator) repo and the |
| 340 | +[docs](https:dvc.org/doc) for more info on building workflows specific to your |
| 341 | +project. |
| 342 | + |
| 343 | +--- |
| 344 | + |
| 345 | +📰 [Join our Newsletter](https://share.hsforms.com/1KRL5_dTbQMKfV7nDD6V-8g4sbyq) |
| 346 | +to stay up to date with news and contributions from the Community! |
0 commit comments