Create an ML pipeline
Start with the basics of steps and pipelines.
In the quest for production-ready ML models, workflows can quickly become complex. Decoupling and standardizing stages such as data ingestion, preprocessing, and model evaluation allows for more manageable, reusable, and scalable processes. ZenML pipelines facilitate this by enabling each stage—represented as Steps—to be modularly developed and then integrated smoothly into an end-to-end Pipeline.
Leveraging ZenML, you can create and manage robust, scalable machine learning (ML) pipelines. Whether for data preparation, model training, or deploying predictions, ZenML standardizes and streamlines the process, ensuring reproducibility and efficiency.
Start with a simple ML pipeline
Let's jump into an example that demonstrates how a simple pipeline can be set up in ZenML, featuring actual ML components to give you a better sense of its application.
Copy this code into a new file and name it run.py. Then run it with your command line:
Explore the dashboard
Once the pipeline has finished its execution, use the zenml login --local command to view the results in the ZenML Dashboard. Using that command will open up the browser automatically.
Usually, the dashboard is accessible at http://127.0.0.1:8237/. Log in with the default username "default" (password not required) and see your recently run pipeline. Browse through the pipeline components, such as the execution history and artifacts produced by your steps. Use the DAG or Timeline visualization to understand the flow of data and to ensure all steps are completed successfully. ZenML offers two visualization modes: the DAG view for understanding pipeline structure and dependencies, and the Timeline view for analyzing execution performance. For pipelines with many steps, the Timeline view provides a cleaner interface for performance optimization. Learn more.
For further insights, explore the logging and artifact information associated with each step, which can reveal details about the data and intermediate results.
If you have closed the browser tab with the ZenML dashboard, you can always reopen it by running zenml show in your terminal.
Understanding steps and artifacts
When you ran the pipeline, each individual function that ran is shown in the run view (DAG or Timeline) as a step and is marked with the function name. Steps are connected with artifacts, which are simply the objects that are returned by these functions and input into downstream functions. This simple logic lets us break down our entire machine learning code into a sequence of tasks that pass data between each other.
The artifacts produced by your steps are automatically stored and versioned by ZenML. The code that produced these artifacts is also automatically tracked. The parameters and all other configuration is also automatically captured.
So you can see, by simply structuring your code within some functions and adding some decorators, we are one step closer to having a more tracked and reproducible codebase!
Expanding to a Full Machine Learning Workflow
With the fundamentals in hand, let’s escalate our simple pipeline to a complete ML workflow. For this task, we will use the well-known Iris dataset to train a Support Vector Classifier (SVC).
Let's start with the imports.
Make sure to install the requirements as well:
In this case, ZenML has an integration with sklearn so you can use the ZenML CLI to install the right version directly.
Define a data loader with multiple outputs
A typical start of an ML pipeline is usually loading data from some source. This step will sometimes have multiple outputs. To define such a step, use a Tuple type annotation. Additionally, you can use the Annotated annotation to assign custom output names. Here we load an open-source dataset and split it into a train and a test dataset.
Create a parameterized training step
Here we are creating a training step for a support vector machine classifier with sklearn. As we might want to adjust the hyperparameter gamma later on, we define it as an input value to the step as well.
Next, we will combine our two steps into a pipeline and run it. As you can see, the parameter gamma is configurable as a pipeline input as well.
Running python run.py should look somewhat like this in the terminal:
In the dashboard, you should now be able to see this new run, along with its runtime configuration and a visualization of the training data.
Configure with a YAML file
Instead of configuring your pipeline runs in code, you can also do so from a YAML file. This is best when we do not want to make unnecessary changes to the code; in production this is usually the case.
To do this, simply reference the file like this:
The reference to a local file will change depending on where you are executing the pipeline and code from, so please bear this in mind. It is best practice to put all config files in a configs directory at the root of your repository and check them into git history.
A simple version of such a YAML file could be:
Please note that this would take precedence over any parameters passed in the code.
If you are unsure how to format this config file, you can generate a template config file from a pipeline.
Check out this section for advanced configuration options.
Full Code Example
This section combines all the code from this section into one simple script that you can use to run easily:
Last updated
Was this helpful?