Materializers

Understanding and creating materializers to handle custom data types in ZenML pipelines

Materializers are a core concept in ZenML that enable the serialization, storage, and retrieval of artifacts in your ML pipelines. This guide explains how materializers work and how to create custom materializers for your specific data types.

What Are Materializers?

A materializer is a class that defines how a particular data type is:

  • Serialized: Converted from Python objects to a storable format

  • Saved: Written to the artifact store

  • Loaded: Read from the artifact store

  • Deserialized: Converted back to Python objects

  • Visualized: Displayed in the ZenML dashboard

  • Analyzed: Metadata extraction for tracking and search

Materializers act as the bridge between your Python code and the underlying storage system, ensuring that any artifact can be saved, loaded, and visualized correctly, regardless of the data type.

Built-In Materializers

ZenML includes built-in materializers for many common data types:

Core Materializers

Materializer
Handled Data Types
Storage Format

bool, float, int, str, None

.json

dict, list, set, tuple

Directory

np.ndarray

.npy

pd.DataFrame, pd.Series

.csv (or .gzip if parquet is installed)

pydantic.BaseModel

.json

zenml.services.service.BaseService

.json

zenml.types.CSVString, zenml.types.HTMLString, zenml.types.MarkdownString

.csv / .html / .md (depending on type)

ZenML also provides a CloudpickleMaterializer that can handle any object by saving it with cloudpickle. However, this is not production-ready because the resulting artifacts cannot be loaded when running with a different Python version. For production use, you should implement a custom materializer for your specific data types.

Integration-Specific Materializers

When you install ZenML integrations, additional materializers become available:

Integration
Materializer
Handled Data Types
Storage Format

bentoml

bentoml.Bento

.bento

deepchecks

deepchecks.CheckResult, deepchecks.SuiteResult

.json

evidently

evidently.Profile

.json

great_expectations

great_expectations.ExpectationSuite, great_expectations.CheckpointResult

.json

huggingface

datasets.Dataset, datasets.DatasetDict

Directory

huggingface

transformers.PreTrainedModel

Directory

huggingface

transformers.TFPreTrainedModel

Directory

huggingface

transformers.PreTrainedTokenizerBase

Directory

lightgbm

lgbm.Booster

.txt

lightgbm

lgbm.Dataset

.binary

neural_prophet

NeuralProphet

.pt

pillow

Pillow.Image

.PNG

polars

pl.DataFrame, pl.Series

.parquet

pycaret

Any sklearn, xgboost, lightgbm or catboost model

.pkl

pytorch

torch.Dataset, torch.DataLoader

.pt

pytorch

torch.Module

.pt

scipy

scipy.spmatrix

.npz

spark

pyspark.DataFrame

.parquet

spark

pyspark.Transformer

pyspark.Estimator

tensorflow

tf.keras.Model

Directory

tensorflow

tf.Dataset

Directory

whylogs

whylogs.DatasetProfileView

.pb

xgboost

xgb.Booster

.json

xgboost

xgb.DMatrix

.binary

jax

jax.Array

.npy

mlx

mlx.core.array

.npy

Note: When using Docker-based orchestrators, you must specify the appropriate integrations in your DockerSettings to ensure the materializers are available inside the container.

Creating Custom Materializers

When working with custom data types, you'll need to create materializers to handle them. Here's how:

1. Define Your Materializer Class

Create a new class that inherits from BaseMaterializer:

2. Using Your Custom Materializer

Once you've defined the materializer, you can use it in your pipeline:

3. Multiple Outputs with Different Materializers

When a step has multiple outputs that need different materializers:

4. Registering a Materializer Globally

You can register a materializer globally to override the default materializer for a specific type:

Materializer Implementation Details

When implementing a custom materializer, consider these aspects:

Handling Storage

The self.uri property contains the path to the directory where your artifact should be stored. Use this path to create files or subdirectories for your data.

When reading or writing files, always use self.artifact_store.open() rather than direct file I/O to ensure compatibility with different artifact stores (local filesystem, cloud storage, etc.).

Visualization Support

The save_visualizations() method allows you to create visualizations that will be shown in the ZenML dashboard. You can return multiple visualizations of different types:

  • VisualizationType.HTML: Embedded HTML content

  • VisualizationType.MARKDOWN: Markdown content

  • VisualizationType.IMAGE: Image files

  • VisualizationType.CSV: CSV tables

Configuring Visualizations

Some materializers support configuration via environment variables to customize their visualization behavior. For example:

  • ZENML_PANDAS_SAMPLE_ROWS: Controls the number of rows shown in sample visualizations created by the PandasMaterializer. Default is 10 rows.

Metadata Extraction

The extract_metadata() method allows you to extract key information about your artifact for indexing and searching. This metadata will be displayed alongside the artifact in the dashboard.

Temporary Files

If you need a temporary directory while processing artifacts, use the get_temporary_directory() helper:

Example: A Complete Materializer

Here's a complete example of a custom materializer for a simple class:

Unmaterialized artifacts

Whenever you pass artifacts as outputs from one pipeline step to other steps as inputs, the corresponding materializer for the respective data type defines how this artifact is first serialized and written to the artifact store, and then deserialized and read in the next step.handle-custom-data-types. However, there are instances where you might not want to materialize an artifact in a step, but rather use a reference to it instead. This is where skipping materialization comes in.

How to skip materialization

While materializers should in most cases be used to control how artifacts are returned and consumed from pipeline steps, you might sometimes need to have a completely unmaterialized artifact in a step, e.g., if you need to know the exact path to where your artifact is stored.

An unmaterialized artifact is a zenml.materializers.UnmaterializedArtifact. Among others, it has a property uri that points to the unique path in the artifact store where the artifact is persisted. One can use an unmaterialized artifact by specifying UnmaterializedArtifact as the type in the step:

The following shows an example of how unmaterialized artifacts can be used in the steps of a pipeline. The pipeline we define will look like this:

s1 and s2 produce identical artifacts, however s3 consumes materialized artifacts while s4 consumes unmaterialized artifacts. s4 can now use the dict_.uri and list_.uri paths directly rather than their materialized counterparts.

You can see another example of using an UnmaterializedArtifact when triggering a pipeline from another.

Best Practices

When working with materializers:

  1. Prefer structured formats over pickle or other binary formats for better cross-environment compatibility.

  2. Test your materializer with different artifact stores (local, S3, etc.) to ensure it works consistently.

  3. Consider versioning if your data structure might change over time.

  4. Create visualizations to help users understand your artifacts in the dashboard.

  5. Extract useful metadata to make artifacts easier to find and understand.

  6. Be explicit about materializer assignments for clarity, even if ZenML can detect them automatically.

  7. Avoid using the CloudpickleMaterializer in production as it's not reliable across different Python versions.

Conclusion

Materializers are a powerful part of ZenML's artifact system, enabling proper storage and handling of any data type. By creating custom materializers for your specific data structures, you ensure that your ML pipelines are robust, efficient, and can handle any data type required by your workflows.

Last updated

Was this helpful?