Cache previous executions
Iterating quickly with ZenML through caching.
Developing machine learning pipelines is iterative in nature. ZenML speeds up development in this work with step caching.
In the logs of your previous runs, you might have noticed at this point that rerunning the pipeline a second time will use caching on the first step:
Step training_data_loader has started.
Using cached version of training_data_loader.
Step svc_trainer has started.
Train accuracy: 0.3416666666666667
Step svc_trainer has finished in 0.932s.

ZenML understands that nothing has changed between subsequent runs, so it re-uses the output of the previous run (the outputs are persisted in the artifact store). This behavior is known as caching.
In ZenML, caching is enabled by default. Since ZenML automatically tracks and versions all inputs, outputs, and parameters of steps and pipelines, steps will not be re-executed within the same pipeline on subsequent pipeline runs as long as there is no change in the inputs, parameters, or code of a step.
If you run a pipeline without a schedule, ZenML will be able to compute the cached steps on your client machine. This means that these steps don't have to be executed by your orchestrator, which can save time and money when you're executing your pipelines remotely. If you always want your orchestrator to compute cached steps dynamically, you can set the ZENML_PREVENT_CLIENT_SIDE_CACHING
environment variable to True
.
The caching does not automatically detect changes within the file system or on external APIs. Make sure to manually set caching to False
on steps that depend on external inputs, file-system changes, or if the step should run regardless of caching.
from zenml import step
@step(enable_cache=False)
def load_data_from_external_system(...) -> ...:
# This step will always be run
Enabling and disabling the caching behavior of your pipelines
With caching as the default behavior, there will be times when you need to disable it.
There are levels at which you can take control of when and where caching is used.
Caching at the pipeline level
On a pipeline level, the caching policy can be set as a parameter within the @pipeline
decorator as shown below:
from zenml import pipeline
@pipeline(enable_cache=False)
def first_pipeline(....):
"""Pipeline with cache disabled"""
The setting above will disable caching for all steps in the pipeline unless a step explicitly sets enable_cache=True
( see below).
Dynamically configuring caching for a pipeline run
Sometimes you want to have control over caching at runtime instead of defaulting to the hard-coded pipeline and step decorator settings. ZenML offers a way to override all caching settings at runtime:
first_pipeline = first_pipeline.with_options(enable_cache=False)
The code above disables caching for all steps of your pipeline, no matter what you have configured in the @step
or @pipeline
decorators.
The with_options
function allows you to configure all sorts of things this way. We will learn more about it in the coming chapters!
Caching at a step-level
Caching can also be explicitly configured at a step level via a parameter of the @step
decorator:
from zenml import step
@step(enable_cache=False)
def import_data_from_api(...):
"""Import most up-to-date data from public api"""
...
The code above turns caching off for this step only.
You can also use with_options
with the step, just as in the pipeline:
import_data_from_api = import_data_from_api.with_options(enable_cache=False)
# use in your pipeline directly
Fine-tuning caching with cache policies
ZenML offers fine-grained control over caching behavior through cache policies. A cache policy determines what factors are considered when generating the cache key for a step. By default, ZenML uses all available information, but you can customize this to optimize caching for your specific use case.
Understanding cache keys
ZenML generates a unique cache key for each step execution based on various factors:
Step code: The actual implementation of your step function
Step parameters: Configuration parameters passed to the step
Input artifact values or IDs: The content/data of input artifacts or their IDs
When any of these factors change, the cache key changes, and the step will be re-executed.
Configuring cache policies
You can configure cache policies at both the step and pipeline level using the CachePolicy
class. Similar to enabling and disabling the cache above, you can define this cache policy on both pipeline and step either via the decorator or the with_options(...)
method. Configuring a cache policy for a pipeline will configure it for all its steps.
from zenml import step, pipeline
from zenml.config import CachePolicy
custom_cache_policy = CachePolicy(include_step_code=False)
@step(cache_policy=custom_cache_policy)
def my_step():
...
# or
my_step = my_step.with_options(cache_policy=custom_cache_policy)
@pipeline(cache_policy=custom_cache_policy)
def my_pipeline():
...
# or
my_pipeline = my_pipeline.with_options(cache_policy=custom_cache_policy)
Cache policy options
Each cache policy option controls a different aspect of caching:
include_step_code
(default:True
): Controls whether changes to your step implementation invalidate the cache.
Setting include_step_code=False
can lead to unexpected behavior if you modify your step logic but expect the changes to take effect.
* `include_step_parameters` (default: `True`): Controls whether step parameter changes invalidate the cache. * `include_artifact_values` (default: `True`): Whether to include the artifact values in the cache key. If the materializer for an artifact doesn't support generating a content hash, the artifact ID will be used as a fallback if enabled. * `include_artifact_ids` (default: `True`): Whether to include the artifact IDs in the cache key. * `ignored_inputs`: Allows you to exclude specific step inputs from cache key calculation.
Code Example
This section combines all the code from this section into one simple script that you can use to see caching easily:

Last updated
Was this helpful?