- 1. Introduction
- 2. Description
- 3. Training data
- 4. Components
- 5. IAM
- 6. Quickstart
- 7. Call for contributions
- 8. License
An Azure Machine Learning (AML) pipeline is a workflow that can be run independently of a complete machine learning task. An AML pipeline helps standardize best practices for producing a machine learning model, enables the team to run at scale, and improves model building efficiency.
The core of a machine learning pipeline is breaking an entire machine learning task into a multi-step workflow. Each step is a manageable component that can be individually developed, optimized, configured, and automated. The steps are connected through well-defined interfaces. The AML pipeline service automatically organizes all dependencies between the pipeline steps. This modular approach brings two key benefits:
- Standardize the practice of machine learning operation (MLOps) and support scalable team collaboration.
- Efficiency in training and cost reduction.
In this use case, we provide an easy to adapt and scalable approach to implement a tabular ML solution.
The following diagram shows a detailed structure of how a simple ML workflow looks like:
There are several components to be discussed in order to keep track of every step through the workflow:
- Code is stored in a repository, and unit test should be performed every time a new change occurs.
- Data is stored in a container, and every time the training pipeline is triggered, it is fetched by a component to create an
MLTable
checkpoint (see next section). - Model training and evaluation is performed using stratified K-Fold strategy, inside a bayesian hyperparameter optimisation procedure.
- Loss function, evaluation metrics and results are monitored using MLFlow. Instead of using hyperparameter tuning functionalities in AzureML, a low-level solution provided by scikit-optimize is recommended.
- A checkpoint in model registry is created with the best hyperparameter configuration, ready to be deployed.
An ideal resource for storing, managing and distributing data are Data Assets. These allow to:
- Create an object that is robust to changes (for example, a column name changes): All consumers of the data must independently update their Python code. Other examples can involve type changes, added / removed columns, encoding change, etc.
- The data size increases - If the data becomes too large for Pandas to process, all the consumers of the data will need to switch to a more scalable library (PySpark/Dask).
Azure Machine Learning Tables allow the data asset creator to define the materialization blueprint in a single file. Then, consumers can then easily materialize the data into a data frame. The consumers can avoid the need to write their own Python parsing logic. The creator of the data asset defines an MLTable
file.
We recommend co-location of the MLTable
file with the underlying data, for example:
├── <container-name>
│ ├── MLTable
│ ├── file_1.csv
.
.
.
│ ├── file_n.csv
Co-location of the MLTable
file with the data ensures a self-contained artifact that stores all needed resources in that one folder, regardless of whether that folder is stored on your local drive, in your cloud store, or on a public http server. Plus, real-life environments are not static, therefore this practice ensures that datacheckpoints are always up to date.
Since the MLTable
will co-locate with the data, the paths defined in the MLTable
file should be relative to the location of the MLTable
file. In the ./setup
folder, you will find the smoke_detection_iot.csv
file, which has been used for testing purposes.
There are multiple scenarios where data access is restricted and additional permissions should be granted. To showcase how an end to end pipeline would take into account data decryption, we introduce two elements:
- A script named
data_encrypt.py
in thescripts
section. - A component named
decrypt_data
in thecomponents
section.
The former makes use of a public keyvault, where as the latter requires a private one.
This use case is prepared for supervised learning problems, where the following assumptions are made over the data:
- Dataset only contain numerical features; categorical ones must be preprocessed in advanced (see
call-for-contributions
section). - Preprocessing performed for features and target variable (if it is a regression scenario) is a normal standarisation. This might change depending on your specific domain, where other possible transformations are Tweedie and binary crossentropy losses for highly-concentrated distributions.
Each of the pipeline components is executed within an AML computing cluster. These are determined by the virtual machines that host them, and whose nomenclature has the following meaning:
STANDARD
: Tier recommended for VM availability purposes.D
: VMs for any purpose.L
: Optimized in terms of storage.S
: Provides premium storage and offers a local SSD for cache.M
: Optimized in terms of memory.G
: Optimized in terms of memory and storage.E
: Optimized for multi-thread operations in memory.V
: Optimized for intensive graphics work and remote viewing workloads.C
: Optimized for high performance computing and ML workloads.N
: Available GPU(s).
In order to optimize the resources we use in each module of our process, the following virtual machines are recommended:
cpu-cluster
: The default host isSTANDARD_DS11_v2
. In case four processing cores are needed, consider usingSTANDARD_DS3_v2
, and for CPU-intensive Machine Learning training, it is recommended to chooseStandard_DC16ds_v3
.gpu-cluster
:- For training on a single GPU, and in order to test solutions and proofs of concept, it is recommended to use (with small batch size)
Standard_NC24ads_A100_v4
because of its cheap price. - For single GPU training on stable solutions and validated pipelines, it is preferable to use one like
Standard_NC16as_T4_v3
if the model does not require large memory capacity, orStandard_NC24ads_A100_v4
for its processing power. - For training on multiple GPUs, and in order to test solutions and proofs of concept, it is recommended to use (with small batch size)
Standard_NC12
due to its cheap price, when making Tesla K80 GPUs available. - For training on multiple GPUs in stable solutions and validated pipelines, it is preferable to use one like
Standard_NC64as_T4_v3
(4GPUs, Tesla T4) if the model does not require large memory capacity,Standard_NC48ads_A100_v4
(2GPUs, Tesla A100) for its good price-performance ratio, orStandard_ND96amsr_A100_v4
(8GPUs, Tesla A100) for its extensive processing power.
- For training on a single GPU, and in order to test solutions and proofs of concept, it is recommended to use (with small batch size)
In the ./setup
folder, you will find a shell script (together with configuration files) to create CPU and GPU clusters.
Pipelines, and in particular AML components, are based on those of Kubeflow, so that the reader will be familiaried with the structure of these projects if he has previously used them. We recommend, in order to maintain good practices and avoid possible errors, to maintain the standardised format that is followed in the components of this project, and to freely manipulate the part of them where the code is entered, conveniently adjusting the rest of the files.
For components that require only CPU units, we follow a structure like this:
├──<component-name> # Where component lies
│ ├── .amlignore # Data not to be included in AML component creation
│ └── docker
│ └── Dockerfile # Environment and dependencies definition
│ ├── <component-name>.yaml # Component configuration file
│ │
│ └── main.py # Entrypoint
Hosting your code in a repository is the best way to track changes and serialise releases. However, there are additional functionalities within the AML ecosystem that allow users to plug and play with pipelines. By opening a terminal and running the shell script contained in ./setup
folder, with <workspace-name>
and <resource-group-name>
as parameters, an instance of your custom components will be created, so that you can prepare, run and monitor pipeline workflows using only the UI (in the Pipelines
section).
AML computing clusters will use a service account to which we must assign a series of roles in order to execute these processes successfully:
- Storage Blob Data Contributor (in storage account resource)
- Storage Queue Data Contributor (in storage account resource)
- AzureML Data Scientist (in AML resource)
- Access to KeyVaults
Once the environment has been created, permissions for service account have been granted and you filled the configuration file with your own data, the fastest way to run AML pipelines is by opening a terminal and launching azureml_pipeline.py
script, using as parameter the configuration file path.
Despite including and end-to-end solution to model design in AML, the following additional features are expected to be developed:
- Adapt
build_MLTable
function in pipeline job script to replicate Data Asset config file (done by @demstalferez). - Monolitic code snippet using MLTable.
- Inference endpoint creation.
- Automatic provisioning and register of compute clusters and environments.
- User-friendly app development.
Released under MIT by @hedrergudene.