Artifacts
Learn how ZenML manages data artifacts, tracks versioning and lineage, and enables effective data flow between steps.
Artifacts are a cornerstone of ZenML's ML pipeline management system. This guide explains what artifacts are, how they work, and how to use them effectively in your pipelines.
Artifacts in the Pipeline Workflow
Here's how artifacts fit into the ZenML pipeline workflow:
A step produces data as output
ZenML automatically stores this output as an artifact
Other steps can use this artifact as input
ZenML tracks the relationships between artifacts and steps
This system creates a complete data lineage for every artifact in your ML workflows, enabling reproducibility and traceability.
Basic Artifact Usage
Creating Artifacts (Step Outputs)
Any value returned from a step becomes an artifact:
from zenml import pipeline, step
import pandas as pd
@step
def create_data() -> pd.DataFrame:
"""Creates a dataframe that becomes an artifact."""
return pd.DataFrame({
"feature_1": [1, 2, 3],
"feature_2": [4, 5, 6],
"target": [10, 20, 30]
})
Consuming Artifacts (Step Inputs)
You can use artifacts by receiving them as inputs to other steps:
@step
def process_data(df: pd.DataFrame) -> pd.DataFrame:
"""Takes an artifact as input and returns a new artifact."""
df["feature_3"] = df["feature_1"] * df["feature_2"]
return df
@pipeline
def simple_pipeline():
"""Pipeline that creates and processes artifacts."""
data = create_data() # Produces an artifact
processed_data = process_data(data) # Uses and produces artifacts
Artifacts vs. Parameters
When calling a step, inputs can be either artifacts or parameters:
Artifacts are outputs from other steps in the pipeline. They are tracked, versioned, and stored in the artifact store.
Parameters are literal values provided directly to the step. They aren't stored as artifacts but are recorded with the pipeline run.
import pandas as pd
from zenml import step, pipeline
@step
def train_model(data: pd.DataFrame, learning_rate: float) -> object:
"""Step with both artifact and parameter inputs."""
# data is an artifact (output from another step)
# learning_rate is a parameter (literal value)
# Note: create_model would be your own model creation function
model = create_model(learning_rate)
model.fit(data)
return model
@pipeline
def training_pipeline():
# data is an artifact
data = create_data()
# data is passed as an artifact, learning_rate as a parameter
model = train_model(data=data, learning_rate=0.01)
Parameters are limited to JSON-serializable values (numbers, strings, lists, dictionaries, etc.). More complex objects should be passed as artifacts.
Accessing Artifacts After Pipeline Runs
You can access artifacts from completed runs using the ZenML Client:
from zenml.client import Client
# Get a specific run
client = Client()
pipeline_run = client.get_pipeline_run("<PIPELINE_RUN_ID>")
# Get an artifact from a specific step
train_data = pipeline_run.steps["split_data"].outputs["train_data"].load()
# Use the artifact
print(train_data.shape)
Working with Artifact Types
Type Annotations
Type annotations are important when working with artifacts as they:
Help ZenML select the appropriate materializer for storage
Validate inputs and outputs at runtime
Document the data flow of your pipeline
from typing import Tuple
import numpy as np
import pandas as pd
from zenml import step
@step
def preprocess_data(df: pd.DataFrame) -> np.ndarray:
"""Type annotation tells ZenML this returns a numpy array."""
return df.values
@step
def split_data(data: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
"""Type annotation tells ZenML this returns a tuple of numpy arrays."""
split_point = len(data) // 2
return data[:split_point], data[split_point:]
ZenML supports many common data types out of the box:
Primitive types (
int
,float
,str
,bool
)Container types (
dict
,list
,tuple
)NumPy arrays
Pandas DataFrames
Many ML model formats (through integrations)
Returning Multiple Outputs
Steps can return multiple artifacts using tuples:
from typing import Tuple, Annotated
import numpy as np
@step
def split_data(
data: np.ndarray,
target: np.ndarray
) -> Tuple[
Annotated[np.ndarray, "X_train"],
Annotated[np.ndarray, "X_test"],
Annotated[np.ndarray, "y_train"],
Annotated[np.ndarray, "y_test"]
]:
"""Split data into training and testing sets."""
# Implement split logic
X_train, X_test = data[:80], data[80:]
y_train, y_test = target[:80], target[80:]
return X_train, X_test, y_train, y_test
ZenML differentiates between:
A step with multiple outputs:
return a, b
orreturn (a, b)
A step with a single tuple output:
return some_tuple
Naming Your Artifacts
By default, artifacts are named based on their position or variable name:
Single outputs are named
output
Multiple outputs are named
output_0
,output_1
, etc.
You can give your artifacts more meaningful names using the Annotated
type:
from typing import Tuple
from typing import Annotated
import pandas as pd
from zenml import step
@step
def split_dataset(
df: pd.DataFrame
) -> Tuple[
Annotated[pd.DataFrame, "train_data"],
Annotated[pd.DataFrame, "test_data"]
]:
"""Split a dataframe into training and testing sets."""
train = df.sample(frac=0.8, random_state=42)
test = df.drop(train.index)
return train, test
You can even use dynamic naming with placeholders:
from typing import Annotated
import pandas as pd
from zenml import step, pipeline
@step
def extract_data(source: str) -> Annotated[pd.DataFrame, "{dataset_type}_data"]:
"""Extract data with a dynamically named output."""
# Implementation...
data = pd.DataFrame() # Your data extraction logic here
return data
@pipeline
def data_pipeline():
# These will create artifacts named "train_data" and "test_data"
train_df = extract_data.with_options(
substitutions={"dataset_type": "train"}
)(source="train_source")
test_df = extract_data.with_options(
substitutions={"dataset_type": "test"}
)(source="test_source")
ZenML supports these placeholders:
{date}
: Current date (e.g., "2023_06_15"){time}
: Current time (e.g., "14_30_45_123456")Custom placeholders can be defined using
substitutions
How Artifacts Work Under the Hood
Materializers: How Data Gets Stored
Materializers are a key concept in ZenML's artifact system. They handle:
Serializing data when saving artifacts to storage
Deserializing data when loading artifacts from storage
Generating visualizations for the dashboard
Extracting metadata for tracking and searching
When a step produces an output, ZenML automatically selects the appropriate materializer based on the data type (using type annotations). ZenML includes built-in materializers for common data types like:
Primitive types (
int
,float
,str
,bool
)Container types (
dict
,list
,tuple
)NumPy arrays, Pandas DataFrames and many other ML-related formats (through integrations)
Here's how materializers work in practice:
from zenml import step
from sklearn.linear_model import LinearRegression
@step
def train_model(X_train, y_train) -> LinearRegression:
"""Train a model and return it as an artifact."""
model = LinearRegression()
model.fit(X_train, y_train)
return model # ZenML uses a specific materializer for scikit-learn models
For custom data types, you can create your own materializers. See the Materializers guide for details.
Lineage and Caching
ZenML automatically tracks the complete lineage of each artifact:
Which step produced it
Which pipeline run it belongs to
Which other artifacts it depends on
Which steps have consumed it
This lineage tracking enables powerful caching capabilities. When you run a pipeline, ZenML checks if any steps have been run before with the same inputs, code, and configuration. If so, it reuses the cached outputs instead of rerunning the step:
@pipeline
def cached_pipeline():
# If create_data has been run before with the same code and inputs,
# the cached artifact will be used
data = create_data()
# If process_data has been run before with the same code and inputs
# (including the exact same data artifact), the cached output will be used
processed_data = process_data(data)
Advanced Artifact Usage
Accessing Artifacts from Previous Runs
You can access artifacts from any previous run by name or ID:
from zenml.client import Client
# Get a specific artifact version
artifact = Client().get_artifact_version("my_model", "1.0")
# Get the latest version of an artifact
latest_artifact = Client().get_artifact_version("my_model")
# Load it into memory
model = latest_artifact.load()
You can also access artifacts within steps:
from zenml.client import Client
from zenml import step
@step
def evaluate_against_previous(model, X_test, y_test) -> float:
"""Compare current model with the previous best model."""
client = Client()
# Get the previous best model
best_model = client.get_artifact_version("best_model")
# Use it for comparison
previous_accuracy = best_model.data.score(X_test, y_test)
current_accuracy = model.score(X_test, y_test)
return current_accuracy - previous_accuracy
Cross-Pipeline Artifact Usage
You can use artifacts produced by one pipeline in another pipeline:
from zenml.client import Client
from zenml import step, pipeline
@step
def use_trained_model(data: pd.DataFrame, model) -> pd.Series:
"""Use a model loaded from a previous pipeline run."""
return pd.Series(model.predict(data))
@pipeline
def inference_pipeline():
# Load data
data = load_data()
# Get the latest model from another pipeline
model = Client().get_artifact_version("trained_model")
# Use it for predictions
predictions = use_trained_model(data=data, model=model)
This allows you to build modular pipelines that can work together as part of a larger ML system.
Visualizing Artifacts
ZenML automatically generates visualizations for many types of artifacts, viewable in the dashboard:
# You can also view visualizations in notebooks
from zenml.client import Client
artifact = Client().get_artifact_version("<ARTIFACT_NAME>")
artifact.visualize()
For detailed information on visualizations, see Visualizations.
Managing Artifacts
Individual artifacts cannot be deleted directly (to prevent broken references). However, you can clean up unused artifacts:
zenml artifact prune
This deletes artifacts that are no longer referenced by any pipeline run. You can control this behavior with flags:
--only-artifact
: Only delete the physical files, keep database entries--only-metadata
: Only delete database entries, keep files--ignore-errors
: Continue pruning even if some artifacts can't be deleted
Registering Existing Data as Artifacts
Sometimes, you may have data created externally (outside of ZenML pipelines) that you want to use within your ZenML workflows. Instead of reading and materializing this data within a step, you can register existing files or folders as ZenML artifacts directly.
Register an Existing Folder
To register a folder as a ZenML artifact:
from zenml.client import Client
from zenml import register_artifact
import os
from pathlib import Path
# Path to an existing folder in your artifact store
prefix = Client().active_stack.artifact_store.path
existing_folder = os.path.join(prefix, "my_folder")
# Register it as a ZenML artifact
register_artifact(
folder_or_file_uri=existing_folder,
name="my_folder_artifact"
)
# Later, load the artifact
folder_path = Client().get_artifact_version("my_folder_artifact").load()
assert isinstance(folder_path, Path)
assert os.path.isdir(folder_path)
Register an Existing File
Similarly, you can register individual files:
from zenml.client import Client
from zenml import register_artifact
import os
from pathlib import Path
# Path to an existing file in your artifact store
prefix = Client().active_stack.artifact_store.path
existing_file = os.path.join(prefix, "my_folder/model.pkl")
# Register it as a ZenML artifact
register_artifact(
folder_or_file_uri=existing_file,
name="my_model_artifact"
)
# Later, load the artifact
file_path = Client().get_artifact_version("my_model_artifact").load()
assert isinstance(file_path, Path)
assert not os.path.isdir(file_path)
This approach is particularly useful for:
Integrating with external ML frameworks that save their own data
Working with pre-existing datasets
Registering model checkpoints created during training
When you load these artifacts, you'll receive a pathlib.Path
pointing to a temporary location in your executing environment, ready for use as a normal local path.
Register Framework Checkpoints
A common use case is registering model checkpoints from training frameworks like PyTorch Lightning:
import os
from uuid import uuid4
from zenml.client import Client
from zenml import register_artifact
from pytorch_lightning import Trainer
from pytorch_lightning.callbacks import ModelCheckpoint
# Define checkpoint location in your artifact store
prefix = Client().active_stack.artifact_store.path
checkpoint_dir = os.path.join(prefix, uuid4().hex)
# Configure PyTorch Lightning trainer with checkpointing
model = YourLightningModel()
trainer = Trainer(
default_root_dir=checkpoint_dir,
callbacks=[
ModelCheckpoint(
every_n_epochs=1,
save_top_k=-1, # Keep all checkpoints
filename="checkpoint-{epoch:02d}"
)
],
)
# Train the model
trainer.fit(model)
# Register all checkpoints as a ZenML artifact
register_artifact(
folder_or_file_uri=checkpoint_dir,
name="lightning_checkpoints"
)
# Later, you can load the checkpoint folder
checkpoint_path = Client().get_artifact_version("lightning_checkpoints").load()
You can also extend the ModelCheckpoint
callback to register each checkpoint as a separate artifact version during training. This approach enables better version control of intermediate checkpoints.
Conclusion
Artifacts are a central part of ZenML's approach to ML pipelines. They provide:
Automatic versioning and lineage tracking
Efficient storage and caching
Type-safe data handling
Visualization capabilities
Cross-pipeline data sharing
By understanding how artifacts work, you can build more effective, maintainable, and reproducible ML pipelines.
For more information on specific aspects of artifacts, see:
Materializers: Creating custom serializers for your data types
Visualizations: Customizing artifact visualizations
Last updated
Was this helpful?