Artifacts
Learn how ZenML manages data artifacts, tracks versioning and lineage, and enables effective data flow between steps.
Artifacts are a cornerstone of ZenML's ML pipeline management system. This guide explains what artifacts are, how they work, and how to use them effectively in your pipelines.
Artifacts in the Pipeline Workflow
Here's how artifacts fit into the ZenML pipeline workflow:
A step produces data as output
ZenML automatically stores this output as an artifact
Other steps can use this artifact as input
ZenML tracks the relationships between artifacts and steps
This system creates a complete data lineage for every artifact in your ML workflows, enabling reproducibility and traceability.
Basic Artifact Usage
Creating Artifacts (Step Outputs)
Any value returned from a step becomes an artifact:
Consuming Artifacts (Step Inputs)
You can use artifacts by receiving them as inputs to other steps:
Artifacts vs. Parameters
When calling a step, inputs can be either artifacts or parameters:
Artifacts are outputs from other steps in the pipeline. They are tracked, versioned, and stored in the artifact store.
Parameters are literal values provided directly to the step. They aren't stored as artifacts but are recorded with the pipeline run.
Parameters are limited to JSON-serializable values (numbers, strings, lists, dictionaries, etc.). More complex objects should be passed as artifacts.
Accessing Artifacts After Pipeline Runs
You can access artifacts from completed runs using the ZenML Client:
Working with Artifact Types
Type Annotations
Type annotations are important when working with artifacts as they:
Help ZenML select the appropriate materializer for storage
Validate inputs and outputs at runtime
Document the data flow of your pipeline
ZenML supports many common data types out of the box:
Primitive types (
int
,float
,str
,bool
)Container types (
dict
,list
,tuple
)NumPy arrays
Pandas DataFrames
Many ML model formats (through integrations)
Returning Multiple Outputs
Steps can return multiple artifacts using tuples:
ZenML differentiates between:
A step with multiple outputs:
return a, b
orreturn (a, b)
A step with a single tuple output:
return some_tuple
Naming Your Artifacts
By default, artifacts are named based on their position or variable name:
Single outputs are named
output
Multiple outputs are named
output_0
,output_1
, etc.
You can give your artifacts more meaningful names using the Annotated
type:
You can even use dynamic naming with placeholders:
ZenML supports these placeholders:
{date}
: Current date (e.g., "2023_06_15"){time}
: Current time (e.g., "14_30_45_123456")Custom placeholders can be defined using
substitutions
How Artifacts Work Under the Hood
Materializers: How Data Gets Stored
Materializers are a key concept in ZenML's artifact system. They handle:
Serializing data when saving artifacts to storage
Deserializing data when loading artifacts from storage
Generating visualizations for the dashboard
Extracting metadata for tracking and searching
When a step produces an output, ZenML automatically selects the appropriate materializer based on the data type (using type annotations). ZenML includes built-in materializers for common data types like:
Primitive types (
int
,float
,str
,bool
)Container types (
dict
,list
,tuple
)NumPy arrays, Pandas DataFrames and many other ML-related formats (through integrations)
Here's how materializers work in practice:
Lineage and Caching
ZenML automatically tracks the complete lineage of each artifact:
Which step produced it
Which pipeline run it belongs to
Which other artifacts it depends on
Which steps have consumed it
This lineage tracking enables powerful caching capabilities. When you run a pipeline, ZenML checks if any steps have been run before with the same inputs, code, and configuration. If so, it reuses the cached outputs instead of rerunning the step:
Advanced Artifact Usage
Accessing Artifacts from Previous Runs
You can access artifacts from any previous run by name or ID:
You can also access artifacts within steps:
Cross-Pipeline Artifact Usage
You can use artifacts produced by one pipeline in another pipeline:
This allows you to build modular pipelines that can work together as part of a larger ML system.
Visualizing Artifacts
ZenML automatically generates visualizations for many types of artifacts, viewable in the dashboard:
Managing Artifacts
Individual artifacts cannot be deleted directly (to prevent broken references). However, you can clean up unused artifacts:
This deletes artifacts that are no longer referenced by any pipeline run. You can control this behavior with flags:
--only-artifact
: Only delete the physical files, keep database entries--only-metadata
: Only delete database entries, keep files--ignore-errors
: Continue pruning even if some artifacts can't be deleted
Registering Existing Data as Artifacts
Sometimes, you may have data created externally (outside of ZenML pipelines) that you want to use within your ZenML workflows. Instead of reading and materializing this data within a step, you can register existing files or folders as ZenML artifacts directly.
Register an Existing Folder
To register a folder as a ZenML artifact:
Register an Existing File
Similarly, you can register individual files:
This approach is particularly useful for:
Integrating with external ML frameworks that save their own data
Working with pre-existing datasets
Registering model checkpoints created during training
When you load these artifacts, you'll receive a pathlib.Path
pointing to a temporary location in your executing environment, ready for use as a normal local path.
Register Framework Checkpoints
A common use case is registering model checkpoints from training frameworks like PyTorch Lightning:
You can also extend the ModelCheckpoint
callback to register each checkpoint as a separate artifact version during training. This approach enables better version control of intermediate checkpoints.
Conclusion
Artifacts are a central part of ZenML's approach to ML pipelines. They provide:
Automatic versioning and lineage tracking
Efficient storage and caching
Type-safe data handling
Visualization capabilities
Cross-pipeline data sharing
By understanding how artifacts work, you can build more effective, maintainable, and reproducible ML pipelines.
For more information on specific aspects of artifacts, see:
Last updated
Was this helpful?