Create an ML pipeline
Start with the basics of steps and pipelines.
Last updated
Start with the basics of steps and pipelines.
Last updated
In the quest for production-ready ML models, workflows can quickly become complex. Decoupling and standardizing stages such as data ingestion, preprocessing, and model evaluation allows for more manageable, reusable, and scalable processes. ZenML pipelines facilitate this by enabling each stage—represented as Steps—to be modularly developed and then integrated smoothly into an end-to-end Pipeline.
Leveraging ZenML, you can create and manage robust, scalable machine learning (ML) pipelines. Whether for data preparation, model training, or deploying predictions, ZenML standardizes and streamlines the process, ensuring reproducibility and efficiency.
Before starting this guide, make sure you have installed ZenML:
Let's jump into an example that demonstrates how a simple pipeline can be set up in ZenML, featuring actual ML components to give you a better sense of its application.
@step
is a decorator that converts its function into a step that can be used within a pipeline
@pipeline
defines a function as a pipeline and within this function, the steps are called and their outputs link them together.
Copy this code into a new file and name it run.py
. Then run it with your command line:
Once the pipeline has finished its execution, use the zenml login --local
command to view the results in the ZenML Dashboard. Using that command will open up the browser automatically.
Usually, the dashboard is accessible at http://127.0.0.1:8237/. Log in with the default username "default" (password not required) and see your recently run pipeline. Browse through the pipeline components, such as the execution history and artifacts produced by your steps. Use the DAG visualization to understand the flow of data and to ensure all steps are completed successfully.
For further insights, explore the logging and artifact information associated with each step, which can reveal details about the data and intermediate results.
If you have closed the browser tab with the ZenML dashboard, you can always reopen it by running zenml show
in your terminal.
When you ran the pipeline, each individual function that ran is shown in the DAG visualization as a step
and is marked with the function name. Steps are connected with artifacts
, which are simply the objects that are returned by these functions and input into downstream functions. This simple logic lets us break down our entire machine learning code into a sequence of tasks that pass data between each other.
The artifacts produced by your steps are automatically stored and versioned by ZenML. The code that produced these artifacts is also automatically tracked. The parameters and all other configuration is also automatically captured.
So you can see, by simply structuring your code within some functions and adding some decorators, we are one step closer to having a more tracked and reproducible codebase!
With the fundamentals in hand, let’s escalate our simple pipeline to a complete ML workflow. For this task, we will use the well-known Iris dataset to train a Support Vector Classifier (SVC).
Let's start with the imports.
Make sure to install the requirements as well:
In this case, ZenML has an integration with sklearn
so you can use the ZenML CLI to install the right version directly.
The zenml integration install sklearn
command is simply doing a pip install
of sklearn
behind the scenes. If something goes wrong, one can always use zenml integration requirements sklearn
to see which requirements are compatible and install using pip (or any other tool) directly. (If no specific requirements are mentioned for an integration then this means we support using all possible versions of that integration/package.)
A typical start of an ML pipeline is usually loading data from some source. This step will sometimes have multiple outputs. To define such a step, use a Tuple
type annotation. Additionally, you can use the Annotated
annotation to assign custom output names. Here we load an open-source dataset and split it into a train and a test dataset.
ZenML records the root python logging handler's output into the artifact store as a side-effect of running a step. Therefore, when writing steps, use the logging
module to record logs, to ensure that these logs then show up in the ZenML dashboard.
Here we are creating a training step for a support vector machine classifier with sklearn
. As we might want to adjust the hyperparameter gamma
later on, we define it as an input value to the step as well.
If you want to run just a single step on your ZenML stack, all you need to do is call the step function outside of a ZenML pipeline. For example:
Next, we will combine our two steps into a pipeline and run it. As you can see, the parameter gamma is configurable as a pipeline input as well.
Best Practice: Always nest the actual execution of the pipeline inside an if __name__ == "__main__"
condition. This ensures that loading the pipeline from elsewhere does not also run it.
Running python run.py
should look somewhat like this in the terminal:
In the dashboard, you should now be able to see this new run, along with its runtime configuration and a visualization of the training data.
Instead of configuring your pipeline runs in code, you can also do so from a YAML file. This is best when we do not want to make unnecessary changes to the code; in production this is usually the case.
To do this, simply reference the file like this:
The reference to a local file will change depending on where you are executing the pipeline and code from, so please bear this in mind. It is best practice to put all config files in a configs directory at the root of your repository and check them into git history.
A simple version of such a YAML file could be:
Please note that this would take precedence over any parameters passed in the code.
If you are unsure how to format this config file, you can generate a template config file from a pipeline.
Check out this section for advanced configuration options.
This section combines all the code from this section into one simple script that you can use to run easily: