Manage artifacts
Understand and adjust how ZenML versions your data.
This is an older version of the ZenML documentation. To read and view the latest version please visit this up-to-date URL.
Data artifact management with ZenML
Data sits at the heart of every machine learning workflow. Managing and versioning this data correctly is essential for reproducibility and traceability within your ML pipelines. ZenML takes a proactive approach to data versioning, ensuring that every artifact—be it data, models, or evaluations—is automatically tracked and versioned upon pipeline execution.
This guide will delve into artifact versioning and management, showing you how to efficiently name, organize, and utilize your data with the ZenML framework.
Managing artifacts produced by ZenML pipelines
Artifacts, the outputs of your steps and pipelines, are automatically versioned and stored in the artifact store. Configuring these artifacts is pivotal for transparent and efficient pipeline development.
Giving names to your artifacts
Assigning custom names to your artifacts can greatly enhance their discoverability and manageability. As best practice, utilize the Annotated
object within your steps to give precise, human-readable names to outputs:
Unspecified artifact outputs default to a naming pattern of {pipeline_name}::{step_name}::output
. For visual exploration in the ZenML dashboard, it's best practice to give significant outputs clear custom names.
Artifacts named iris_dataset
can then be found swiftly using various ZenML interfaces:
To list artifacts: zenml artifacts list
Versioning artifacts manually
ZenML automatically versions all created artifacts using auto-incremented numbering. I.e., if you have defined a step creating an artifact named iris_dataset
as shown above, the first execution of the step will create an artifact with this name and version "1", the second execution will create version "2", and so on.
While ZenML handles artifact versioning automatically, you have the option to specify custom versions using the ArtifactConfig
. This may come into play during critical runs like production releases.
The next execution of this step will then create an artifact with the name iris_dataset
and version raw_2023
. This is primarily useful if you are making a particularly important pipeline run (such as a release) whose artifacts you want to distinguish at a glance later.
Since custom versions cannot be duplicated, the above step can only be run once successfully. To avoid altering your code frequently, consider using a YAML config for artifact versioning.
After execution, iris_dataset
and its version raw_2023
can be seen using:
To list versions: zenml artifacts versions list
Consuming external artifacts within a pipeline
While most pipelines start with a step that produces an artifact, it is often the case to want to consume artifacts external from the pipeline. The ExternalArtifact
class can be used to initialize an artifact within ZenML with any arbitrary data type.
For example, let's say we have a Snowflake query that produces a dataframe, or a CSV file that we need to read. External artifacts can be used for this, to pass values to steps that are neither JSON serializable nor produced by an upstream step:
Optionally, you can configure the ExternalArtifact
to use a custom materializer for your data or disable artifact metadata and visualizations. Check out the SDK docs for all available options.
Consuming artifacts produced by other pipelines
It is also common to consume an artifact downstream after producing it in an upstream pipeline or step. As we have learned in the previous section, the Client
can be used to fetch artifacts directly. However, in ZenML the best practice is not to use the Client
for this use-case, but rather use the ExternalArtifact
to pass existing artifacts from other pipeline runs into your steps. This is a more convenient interface:
Using an ExternalArtifact
with input data for your step automatically disables caching for the step.
Managing artifacts not produced by ZenML pipelines
Sometimes, artifacts can be produced completely outside of ZenML. A good example of this is the predictions produced by a deployed model.
You can also load any artifact stored within ZenML using the load_artifact
method:
load_artifact
is simply short-hand for the following Client call:
Even if an artifact is created externally, it can be treated like any other artifact produced by ZenML steps - with all the functionalities described above!
It is also possible to use these functions inside your ZenML steps. However, it is usually cleaner to return the artifacts as outputs of your step to save them, or to use External Artifacts to load them instead.
Logging metadata for an artifact
One of the most useful ways of interacting with artifacts in ZenML is the ability to associate metadata with them. As mentioned before, artifact metadata is an arbitrary dictionary of key-value pairs that are useful for understanding the nature of the data.
As an example, one can associate the results of a model training alongside a model artifact, the shape of a table alongside a pandas
dataframe, or the size of an image alongside a PNG file.
For some artifacts, ZenML automatically logs metadata. As an example, for pandas.Series
and pandas.DataFrame
objects, ZenML logs the shape and size of the objects:
A user can also add metadata to an artifact within a step directly using the log_artifact_metadata
method:
For further depth, there is an advanced metadata logging guide that goes more into detail about logging metadata in ZenML.
Additionally, there is a lot more to learn about artifacts within ZenML. Please read the dedicated data management guide for more information.
Code example
This section combines all the code from this section into one simple script that you can use easily:
Last updated