Improve Chapter 2.5 on Modeling (#30)

google-labs-jules[bot] · fmind · web-flow · commit 575e8ea36313 · 2025-08-07T22:08:13.000+02:00
* feat: improve chapter 2.5 on modeling

This commit enhances chapter 2.5 of the MLOps Coding Course with the following improvements:

- **Richer Content**: The explanations for pipelines, data processing, caching, hyperparameter tuning, cross-validation, and model retraining have been expanded to be more detailed and easier to understand.
- **Improved Structure**: The content has been reorganized for a more logical flow, and emojis have been added to section titles for better visual navigation.
- **Key Takeaways**: A new "Key Takeaways" section has been added at the end of the chapter to summarize the main points and reinforce learning.

* review

---------

Co-authored-by: google-labs-jules[bot] &lt;161369871+google-labs-jules[bot]@users.noreply.github.com&gt;
Co-authored-by: Médéric Hurier (Fmind) &lt;github@fmind.dev&gt;
diff --git a/docs/2. Prototyping/2.5. Modeling.md b/docs/2. Prototyping/2.5. Modeling.md
@@ -2,15 +2,15 @@
 description: Learn how to build, refine, and compare machine learning models directly within notebooks, covering everything from initial prototypes to model selection and hyperparameter tuning.
 ---
 
-# 2.5. Modeling
+# 💃 2.5. Modeling
 
-## What are pipelines?
+## 🤔 What are pipelines?
 
 [Pipelines in machine learning](https://scikit-learn.org/stable/modules/compose.html#pipeline) provide a streamlined way to organize sequences of data preprocessing and modeling steps. They encapsulate a series of data transformations followed by the application of a model, facilitating both simplicity and efficiency in the development process. Pipelines can be broadly categorized as follows:
 
-- **Model Pipeline**: Focuses specifically on sequences related to preparing data for machine learning models and applying these models. For instance, scikit-learn's [`Pipeline`](https://scikit-learn.org/stable/modules/compose.html#pipeline) class allows for chaining [preprocessors](https://scikit-learn.org/stable/modules/preprocessing.html) and [estimators](https://scikit-learn.org/stable/developers/develop.html).
-- **Data Pipeline**: Encompasses a wider scope, including steps for data gathering, cleaning, and transformation. Tools such as [Prefect](https://www.prefect.io/) and [ZenML](https://docs.zenml.io/user-guide/starter-guide/create-an-ml-pipeline) offer capabilities for building comprehensive data pipelines.
-- **Orchestration Pipeline**: Targets the automation of a series of tasks, including data and model pipelines, ensuring they execute in an orderly fashion or under specific conditions. Examples include [Apache Airflow](https://airflow.apache.org/) for creating directed acyclic graphs (DAGs) and [Vertex AI for managing ML workflows](https://cloud.google.com/vertex-ai/docs/pipelines/introduction).
+- **Model Pipeline**: Focuses on preparing data for and applying machine learning models. A typical example is a `scikit-learn` pipeline that chains preprocessing steps (like scaling and encoding) with a final estimator (like a classifier or regressor).
+- **Data Pipeline**: Covers a broader range of data-related tasks, including extraction, transformation, and loading (ETL). These pipelines gather data from various sources, clean and process it, and load it into a destination like a data warehouse. Tools like `Prefect` and `ZenML` are designed for building robust data pipelines.
+- **Orchestration Pipeline**: Manages and automates a series of tasks, which can include both data and model pipelines. Orchestration tools ensure that tasks run in the correct order, handle dependencies, and manage resources. `Apache Airflow` and `Vertex AI` are popular examples for creating and managing complex workflows.
 
 For the purposes of this discussion, we'll focus on **model pipelines**, crucial for efficiently prototyping machine learning solutions. The code example are based on [scikit-learn pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html), as this toolkit is simple to understand and its concept can be generalized to other types of pipeline like [Dagster](https://dagster.io/), [Prefect](https://www.prefect.io/), or [Metaflow](https://metaflow.org/).
 
@@ -40,25 +40,27 @@ draft = pipeline.Pipeline(
 
 ![Model pipeline](../img/models/pipeline.png)
 
-## Why do you need to use a pipeline?
+## 🚀 Why do you need to use a pipeline?
 
 Implementing pipelines in your machine learning projects offers several key advantages:
 
-- **Prevents Data Leakage during preprocessing**: By ensuring data preprocessing steps are applied correctly during model training and validation, pipelines help maintain the integrity of your data.
-- **Simplifies Cross-Validation and Hyperparameter Tuning**: Pipelines facilitate the application of transformations to data subsets appropriately during procedures like cross-validation, ensuring accurate and reliable model evaluation.
-- **Ensures Consistency**: Pipelines guarantee that the same preprocessing steps are executed in both the model training and inference phases, promoting consistency and reliability in your ML workflow.
+- **Prevents Data Leakage**: Pipelines ensure that preprocessing steps are applied separately to the training and validation sets within each fold of cross-validation. This prevents information from the validation set from "leaking" into the training process, which can lead to overly optimistic performance estimates.
+- **Simplifies Cross-Validation**: With a pipeline, you can treat a sequence of preprocessing steps and a model as a single object. This makes it much easier to perform cross-validation, as you only need to call `fit` and `predict` on the pipeline itself, rather than on each individual component.
+- **Ensures Consistency**: A pipeline guarantees that the same preprocessing steps are applied to both your training data and any new data you want to make predictions on. This consistency is crucial for ensuring that your model behaves as expected in production.
+- **Improves Reproducibility**: By encapsulating the entire workflow, pipelines make it easier for others (and your future self) to reproduce your results.
 
-Pipelines thus represent an essential tool in the machine learning toolkit, streamlining the model development process and enhancing model performance and evaluation.
+In short, pipelines are a fundamental tool for building robust and reliable machine learning models.
 
-## Why do you need to process inputs by type?
+## 🎨 Why do you need to process inputs by type?
 
-Different data types typically require distinct preprocessing steps to prepare them effectively for machine learning models:
+Different data types require distinct preprocessing steps to be effectively used by machine learning models. Here’s a breakdown of why and how:
 
-- **Numerical Features** may benefit from scaling or normalization to ensure that they're on a similar scale.
-- **Categorical Features** often require encoding (e.g., OneHotEncoding) to transform them into a numerical format that models can understand.
-- **Datetime Features** might be broken down into more granular components (e.g., year, month, day) to capture temporal patterns more effectively.
+- **Numerical Features**: These features often need to be scaled to a common range (e.g., 0 to 1) or standardized to have a mean of 0 and a standard deviation of 1. This is important for algorithms that are sensitive to the scale of the input features, such as Support Vector Machines (SVMs) and k-Nearest Neighbors (k-NN).
+- **Categorical Features**: Machine learning models can only work with numerical data, so categorical features (like `red`, `green`, `blue`) must be converted into a numerical format. Common techniques include one-hot encoding, which creates a new binary feature for each category, and ordinal encoding, which assigns a unique integer to each category.
+- **Datetime Features**: These features are rich in information but need to be broken down into more granular components that models can understand. For example, you can extract the year, month, day of the week, or even the hour of the day to capture temporal patterns.
+- **Text Features**: Text data needs to be converted into a numerical representation, often using techniques like TF-IDF or word embeddings (e.g., Word2Vec, GloVe).
 
-Utilizing [scikit-learn's `ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html), you can specify different preprocessing steps for different columns of your data, ensuring that each type is handled appropriately.
+The `ColumnTransformer` in `scikit-learn` is a powerful tool that allows you to apply different preprocessing steps to different columns of your data. This ensures that each feature type is handled appropriately, which is crucial for building accurate and reliable models.
 
 Example of [selecting features by type from a Pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.select_dtypes.html):
 
@@ -70,9 +72,16 @@ num_features = X_train.select_dtypes(include=['number']).columns.tolist()
 cat_features = X_train.select_dtypes(include=['object', 'category']).columns.tolist()
 ```
 
-## What is the benefit of using a memory cache?
+## ⚡️ What is the benefit of using a memory cache?
 
-[Employing a memory cache with pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html), such as the `memory` attribute in scikit-learn's `Pipeline`, offers significant performance benefits by caching the results of transformation steps. This approach is particularly advantageous during operations like grid search, where certain preprocessing steps are repeatedly executed across different parameter combinations. Caching can dramatically reduce computation time by avoiding redundant processing.
+[Employing a memory cache with pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html), such as the `memory` attribute in `scikit-learn`'s `Pipeline`, can dramatically speed up your workflow. Caching stores the results of a transformation step, so that it doesn't have to be recomputed every time the pipeline is run.
+
+This is especially useful in scenarios like:
+
+- **Grid Search**: When performing a grid search, you are fitting the same pipeline multiple times with different hyperparameters. If your preprocessing steps are computationally expensive, caching them can save a significant amount of time.
+- **Iterative Development**: When you are experimenting with different models or hyperparameters, you often need to re-run your pipeline multiple times. Caching the initial transformation steps means you only pay the computational cost once.
+
+By caching the results of your transformers, you can make your development process faster and more efficient, allowing you to iterate more quickly and focus on the modeling aspect of your project.
 
 Example of utilizing a memory cache with a pipeline:
 
@@ -99,9 +108,13 @@ draft = pipeline.Pipeline(
 
 Even if you don't plan on using [scikit-learn pipeline abstraction](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html), you can implement the same concept in your code base to obtain the same benefits.
 
-## How can you change the pipeline hyper-parameters?
+## 🛠️ How can you change the pipeline hyper-parameters?
 
-Adjusting hyper-parameters within a [scikit-learn pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) can be achieved using the `set_params` method or by directly accessing parameters via the double underscore (`__`) notation. This flexibility allows you to fine-tune your model directly within the pipeline structure.
+Adjusting hyperparameters within a [scikit-learn pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) is a common task, and `scikit-learn` provides a convenient way to do this using the `set_params` method. This method allows you to change the hyperparameters of any step in the pipeline, whether it's a transformer or a model.
+
+The key to `set_params` is the double underscore (`__`) notation. You use it to specify the name of the step, followed by the name of the hyperparameter you want to change. For example, `regressor__n_estimators` refers to the `n_estimators` hyperparameter of the `regressor` step.
+
+This is particularly useful when you want to programmatically change hyperparameters, for example, in a loop or as part of a grid search. It allows you to fine-tune your model directly within the pipeline structure, without having to manually recreate the pipeline every time you want to try a new set of hyperparameters.
 
 Example of setting pipeline hyper-parameters:
 
@@ -119,9 +132,13 @@ pipeline = Pipeline([
 pipeline.set_params(regressor__n_estimators=100, regressor__max_depth=10)
 ```
 
-## Why do you need to perform a grid search with your pipeline?
+## 🔍 Why do you need to perform a grid search with your pipeline?
+
+Conducting a [grid search over a pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) is a powerful technique for finding the best combination of hyperparameters for your model. It works by exhaustively searching through a specified set of hyperparameter values and evaluating each combination using cross-validation.
 
-Conducting [a grid search over a pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) is crucial for identifying the optimal combination of model hyper-parameters. This exhaustive search evaluates various parameter combinations across your dataset, using cross-validation to ensure robust assessment of model performance.
+When you perform a grid search on a pipeline, you can tune the hyperparameters of both the preprocessing steps and the model itself. This is important because the optimal hyperparameters for your model may depend on the preprocessing steps you apply.
+
+For example, you could search over different encoding strategies for your categorical features, different scaling methods for your numerical features, and different hyperparameters for your model, all at the same time. This allows you to find the best possible combination of preprocessing and modeling choices for your specific problem.
 
 Example of performing grid search with a pipeline:
 
@@ -144,9 +161,13 @@ search = GridSearchCV(
 search.fit(inputs_train, targets_train)
 ```
 
-## Why do you need to perform cross-validation with your pipeline?
+## 📊 Why do you need to perform cross-validation with your pipeline?
+
+[Cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)) is a powerful technique for assessing how well your model will generalize to unseen data. It works by splitting your training data into a number of "folds," and then training and evaluating your model multiple times, using a different fold as the validation set each time.
+
+When you use cross-validation with a pipeline, you ensure that the entire workflow, including preprocessing, is evaluated at each step. This gives you a more reliable estimate of your model's performance than a simple train-test split.
 
-[Cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)) is a fundamental technique in the validation process of machine learning models, enabling you to assess how well your model is likely to perform on unseen data. By integrating cross-validation into your pipeline, you can ensure a thorough evaluation of your model's performance, mitigating the risk of overfitting and underfitting.
+`scikit-learn`'s `GridSearchCV` automatically performs cross-validation for you, but you can also use other cross-validation strategies, such as `TimeSeriesSplit` for time-series data, or `StratifiedKFold` for classification problems with imbalanced classes. The `cv` parameter in `GridSearchCV` allows you to specify the cross-validation strategy that is most appropriate for your problem.
 
 When utilizing [`GridSearchCV` from scikit-learn for hyperparameter tuning](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html), the `cv` parameter plays a crucial role in defining the cross-validation splitting strategy. This flexibility allows you to tailor the cross-validation process to the specific needs of your dataset and problem domain, ensuring that the model evaluation is both thorough and relevant.
 
@@ -160,11 +181,13 @@ Here’s a breakdown of how you can control the cross-validation behavior throug
 
 - **Iterable**: An iterable yielding train/test splits as arrays of indices directly specifies the data partitions for each fold. This option offers maximum flexibility, allowing for completely custom splits based on external logic or considerations (e.g., predefined groups or stratifications not captured by the standard splitters).
 
-## Do you need to retrain your pipeline? Should you use the full dataset?
+## 🔄 Do you need to retrain your pipeline? Should you use the full dataset?
 
-After identifying the best model and hyper-parameters through grid search and cross-validation, it's common practice to retrain your model on the entire dataset. This approach allows you to leverage all available data, maximizing the model's learning and potentially enhancing its performance when making predictions on new, unseen data.
+After identifying the best model and hyperparameters through grid search and cross-validation, it is a common practice to retrain the model on the entire dataset. This is because the more data a model is trained on, the better it is likely to perform.
 
-Retraining your model on the full dataset takes advantage of the insights gained during the model selection process, ensuring that the final model is as robust and well-tuned as possible.
+By default, `scikit-learn`'s `GridSearchCV` automatically refits the best model on the entire dataset after the grid search is complete. This means that the `best_estimator_` attribute of the `GridSearchCV` object is a model that has been trained on all of the available data.
+
+However, it's important to remember that once you've retrained your model on the full dataset, you no longer have a separate validation set to evaluate its performance. This is why it's so important to have a robust cross-validation strategy in the first place. The cross-validation results give you a good estimate of how well your final model will perform on unseen data.
 
 Example of retraining your pipeline on the full dataset:
 
@@ -185,7 +208,17 @@ In this way, the final model embodies the culmination of your exploratory work,
 
 It's important to note, however, that while retraining on the full dataset can improve performance, it also eliminates the possibility of evaluating the model on unseen data unless additional, separate validation data is available. Therefore, the decision to retrain should be made with consideration of how model performance will be assessed and validated post-retraining.
 
-## Modeling additional resources
+## 🔑 Key Takeaways
+
+- **Pipelines are Essential**: They streamline the workflow, prevent data leakage, and simplify complex sequences of transformations and modeling.
+- **Process by Type**: Always preprocess features based on their data type (numerical, categorical, etc.) to ensure models can interpret them correctly.
+- **Cache for Speed**: Use memory caching in pipelines to avoid redundant computations, especially during hyperparameter tuning, which significantly speeds up the process.
+- **Tune Hyperparameters**: Use `set_params` and grid search to systematically find the best hyperparameters for your pipeline.
+- **Cross-Validate Thoroughly**: Employ cross-validation to get a reliable estimate of your model's performance on unseen data.
+- **Retrain on Full Data**: After finding the best model, retrain it on the entire dataset to maximize its predictive power.
+- **Scikit-learn is Powerful**: `scikit-learn` provides a comprehensive toolkit for building, evaluating, and tuning machine learning pipelines.
+
+## 📚 Additional resources
 
 - **[Example from the MLOps Python Package](https://github.com/fmind/mlops-python-package/blob/main/notebooks/prototype.ipynb)**
 - [Supervised learning](https://scikit-learn.org/stable/supervised_learning.html)