Steps & Pipelines
Steps and Pipelines are the core building blocks of ZenML
Steps and Pipelines are the fundamental building blocks of ZenML. A Step is a reusable unit of computation, and a Pipeline is a directed acyclic graph (DAG) composed of steps. Together, they allow you to define, version, and execute machine learning workflows.
The Relationship Between Steps and Pipelines
In ZenML, steps and pipelines work together in a clear hierarchy:
Steps are individual functions that perform specific tasks, like loading data, processing it, or training models
Pipelines orchestrate these steps, connecting them in a defined sequence where outputs from one step can flow as inputs to others
Each step produces artifacts that are tracked, versioned, and can be reused across pipeline runs
Think of a step as a single LEGO brick, and a pipeline as the complete structure you build by connecting many bricks together.
Basic Steps
Creating a Simple Step
A step is created by applying the @step
decorator to a Python function:
Step Inputs and Outputs
Steps can take inputs and produce outputs. These can be simple types, complex data structures, or custom objects.
In this example:
The step takes a
dict
as input containing features and labelsIt processes the features and computes some statistics
It returns a new
dict
as output with the processed data and additional information
Custom Output Names
You can name your step outputs using the Annotated
type:
By default, step outputs are named output
for single output steps and output_0
, output_1
, etc. for steps with multiple outputs.
Basic Pipelines
Creating a Simple Pipeline
A pipeline is created by applying the @pipeline
decorator to a Python function that composes steps together:
Running Pipelines
You can run a pipeline by simply calling the function:
The run is automatically logged to the ZenML dashboard where you can view the DAG and associated metadata.
End-to-End Example
Here's a simple end-to-end example that demonstrates the basic workflow:
Parameters and Artifacts
Understanding the Difference
ZenML distinguishes between two types of inputs to steps:
Artifacts: Outputs from other steps in the same pipeline
These are tracked, versioned, and stored in the artifact store
They are passed between steps and represent data flowing through your pipeline
Examples: datasets, trained models, evaluation metrics
Parameters: Direct values provided when invoking a step
These are typically simple configuration values passed directly to the step
They're not tracked as separate artifacts but are recorded with the pipeline run
Examples: learning rates, batch sizes, model hyperparameters
This example demonstrates the difference:
Parameter Types
Parameters can be:
Primitive types:
int
,float
,str
,bool
Container types:
list
,dict
,tuple
(containing primitives)Custom types: As long as they can be serialized to JSON using Pydantic
Parameters that cannot be serialized to JSON should be passed as artifacts rather than parameters.
Parameterizing Workflows
Step Parameterization
Steps can take parameters like regular Python functions:
Pipeline Parameterization
Pipelines can also be parameterized, allowing values to be passed down to steps:
You can then run the pipeline with specific parameters:
Step Type Handling & Output Management
Type Annotations
While optional, type annotations are highly recommended and provide several benefits:
Type validation: ZenML validates inputs against type annotations at runtime to catch errors early.
Code documentation: Types make your code more self-documenting and easier to understand.
When you specify a return type like -> float
or -> Tuple[int, int]
, ZenML uses this information to determine how to store the step's output in the artifact store. For instance, a step returning a pandas DataFrame with the annotation -> pd.DataFrame
will use the pandas-specific materializer for efficient storage.
Multiple Return Values
Steps can return multiple artifacts:
ZenML uses the following convention to differentiate between a single output of type Tuple
and multiple outputs:
When the
return
statement is followed by a tuple literal (e.g.,return 1, 2
orreturn (value_1, value_2)
), it's treated as a step with multiple outputsAll other cases are treated as a step with a single output of type
Tuple
Conclusion
Steps and Pipelines provide a flexible, powerful way to build machine learning workflows in ZenML. This guide covered the basic concepts of creating steps and pipelines, managing inputs and outputs, and working with parameters.
Last updated
Was this helpful?