Step Parameterization and Caching
Iteration is native to ZenML.
Machine learning pipelines are rerun many times over throughout their development lifecycle.
In order to iterate quickly, one must be able to quickly tweak pipeline runs by changing various parameters of steps within a pipeline.
You can configure your pipelines at runtime in the following ways:
BaseParameters: Runtime configuration passed down as a parameter to step functions.
BaseSettings: Runtime settings passed down to stack components and pipelines.
You can parameterize a step by creating a subclass of the
BaseParameters. When such a config object is passed to a step, it is not treated like other artifacts. Instead, it gets passed into the step when the pipeline is instantiated.
import numpy as np
from sklearn.base import ClassifierMixin
from sklearn.svm import SVC
from zenml.steps import step, BaseParameters
gamma: float = 0.001
) -> ClassifierMixin:
"""Train a sklearn SVC classifier."""
model = SVC(gamma=params.gamma)
The default value for the
gammaparameter is set to
0.001. However, when the pipeline is instantiated you can override the default like this:
first_pipeline_instance = first_pipeline(
Try running the above pipeline, and changing the parameter
gammathrough many runs. In essence, each pipeline can be viewed as an experiment, and each run is a trial of the experiment, defined by the
BaseParameters. You can always get the parameters again when you fetch pipeline runs, to compare various runs.
When you tweaked the
gammavariable above, you must have noticed that the
digits_data_loaderstep does not re-execute for each subsequent run. This is because ZenML understands that nothing has changed between subsequent runs, so it re-uses the output of the last run (the outputs are persisted in the artifact store. This behavior is known as caching.
ZenML comes with caching enabled by default. Since ZenML automatically tracks and versions all inputs, outputs, and parameters of steps and pipelines, ZenML will not re-execute steps within the same pipeline on subsequent pipeline runs as long as there is no change in these three.
Although caching is desirable in many circumstances, one might want to disable it in certain instances. For example, if you are quickly prototyping with changing step definitions or you have an external API state change in your function that ZenML does not detect.
There are multiple ways to take control of when and where caching is used:
On a pipeline level the caching policy can be set as a parameter within the decorator.
"""Pipeline with cache disabled"""
Caching can also be explicitly turned off at a step level. You might want to turn off caching for steps that take external input (like fetching data from an API or File IO).
"""Import most up-to-date data from public api"""
Sometimes you want to have control over caching at runtime instead of defaulting to the backed in configurations of your pipeline and its steps. ZenML offers a way to override all caching settings of the pipeline at runtime.
The following example shows caching in action with the code example from the previous section.