> For the complete documentation index, see [llms.txt](https://docs.zenml.io/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.zenml.io/concepts/artifacts/materializers.md). # Materializers Materializers are a core concept in ZenML that enable the serialization, storage, and retrieval of artifacts in your ML pipelines. This guide explains how materializers work and how to create custom materializers for your specific data types. ## What Are Materializers? A materializer is a class that defines how a particular data type is: * **Serialized**: Converted from Python objects to a storable format * **Saved**: Written to the artifact store * **Loaded**: Read from the artifact store * **Deserialized**: Converted back to Python objects * **Visualized**: Displayed in the ZenML dashboard * **Analyzed**: Metadata extraction for tracking and search Materializers act as the bridge between your Python code and the underlying storage system, ensuring that any artifact can be saved, loaded, and visualized correctly, regardless of the data type. ## Built-In Materializers ZenML includes built-in materializers for many common data types: ### Core Materializers

Materializer	Handled Data Types	Storage Format
BuiltInMaterializer	`bool`, `float`, `int`, `str`, `None`	`.json`
BytesInMaterializer	`bytes`	`.txt`
BuiltInContainerMaterializer	`dict`, `list`, `set`, `tuple`	Directory
NumpyMaterializer	`np.ndarray`	`.npy`
PandasMaterializer	`pd.DataFrame`, `pd.Series`	`.csv` (or `.gzip` if `parquet` is installed)
PydanticMaterializer	`pydantic.BaseModel`	`.json`
DataclassMaterializer	JSON-serializable Python `dataclass` types	`.json`
ServiceMaterializer	`zenml.services.service.BaseService`	`.json`
StructuredStringMaterializer	`zenml.types.CSVString`, `zenml.types.HTMLString`, `zenml.types.MarkdownString`	`.csv` / `.html` / `.md` (depending on type)
PathMaterializer	`pathlib.Path`	`.tar.gz` (directories) or direct copy (files)

ZenML also provides a CloudpickleMaterializer that can handle any object by saving it with [cloudpickle](https://github.com/cloudpipe/cloudpickle). However, this is not production-ready because the resulting artifacts cannot be loaded when running with a different Python version. For production use, you should implement a custom materializer for your specific data types. {% hint style="info" %} Pydantic artifacts created by current ZenML versions are stored in `data_v2.json`. ZenML can still load older Pydantic artifacts stored as `data.json` by ZenML `<= 0.94.2`, so existing runs remain readable after an upgrade. {% endhint %} ### Dataclass artifacts The `DataclassMaterializer` handles JSON-serializable Python dataclasses without requiring you to write a custom materializer. ```python from dataclasses import dataclass from zenml import step @dataclass class TrainingConfig: learning_rate: float epochs: int @step def make_config() -> TrainingConfig: return TrainingConfig(learning_rate=0.01, epochs=10) ``` This works for dataclasses that Pydantic can serialize to JSON. If your dataclass contains objects such as open file handles, live model objects, database connections, or other arbitrary Python objects, use a custom materializer instead. ### Passing Files and Directories Between Steps The `PathMaterializer` lets you pass `pathlib.Path` objects between steps. This is especially useful when working with files or directories that need to be shared across steps — for example, dataset directories, exported model files, or any file-based artifacts. When a step returns a `Path`: * **Directories** are compressed into a `.tar.gz` archive and uploaded to the artifact store * **Single files** are copied directly to the artifact store When a downstream step receives the `Path`, the materializer downloads the contents to a local temporary directory and returns a `Path` pointing to it. ```python from pathlib import Path from typing import Annotated from zenml import step, pipeline @step def prepare_dataset(num_samples: int = 100) -> Annotated[Path, "dataset_dir"]: """Prepare a dataset directory with training files.""" output_dir = Path("training_data") output_dir.mkdir(exist_ok=True) # Write training files into the directory (output_dir / "features.csv").write_text("feature1,feature2\n1.0,2.0\n") (output_dir / "labels.csv").write_text("label\n1\n") # ZenML will tar.gz this directory and upload it to the artifact store return output_dir @step def train_model(dataset_dir: Path) -> None: """Train a model using the dataset directory.""" # dataset_dir points to a local temp directory with the extracted contents features = (dataset_dir / "features.csv").read_text() labels = (dataset_dir / "labels.csv").read_text() print(f"Training with features: {features}") @pipeline def training_pipeline(): dataset = prepare_dataset() train_model(dataset) ``` This works transparently with remote orchestrators (Kubernetes, Vertex AI, etc.) where each step runs on a different pod — the artifact store acts as the shared transport layer. {% hint style="info" %} If you prefer the previous behavior where `Path` objects were serialized with `cloudpickle` (which only preserves the path string, not the file contents), you can disable the `PathMaterializer` by setting the environment variable `ZENML_DISABLE_PATH_MATERIALIZER=true`. {% endhint %} ### Integration-Specific Materializers When you install ZenML integrations, additional materializers become available:

Integration	Materializer	Handled Data Types	Storage Format
bentoml	BentoMaterializer	`bentoml.Bento`	`.bento`
deepchecks	DeepchecksResultMateriailzer	`deepchecks.CheckResult`, `deepchecks.SuiteResult`	`.json`
evidently	EvidentlyProfileMaterializer	`evidently.Profile`	`.json`
great_expectations	GreatExpectationsMaterializer	`great_expectations.ExpectationSuite`, `great_expectations.CheckpointResult`	`.json`
huggingface	HFDatasetMaterializer	`datasets.Dataset`, `datasets.DatasetDict`	Directory
huggingface	HFPTModelMaterializer	`transformers.PreTrainedModel`	Directory
huggingface	HFTFModelMaterializer	`transformers.TFPreTrainedModel`	Directory
huggingface	HFTokenizerMaterializer	`transformers.PreTrainedTokenizerBase`	Directory
lightgbm	LightGBMBoosterMaterializer	`lgbm.Booster`	`.txt`
lightgbm	LightGBMDatasetMaterializer	`lgbm.Dataset`	`.binary`
neural_prophet	NeuralProphetMaterializer	`NeuralProphet`	`.pt`
pillow	PillowImageMaterializer	`Pillow.Image`	`.PNG`
polars	PolarsMaterializer	`pl.DataFrame`, `pl.Series`	`.parquet`
pycaret	PyCaretMaterializer	Any `sklearn`, `xgboost`, `lightgbm` or `catboost` model	`.pkl`
pytorch	PyTorchDataLoaderMaterializer	`torch.Dataset`, `torch.DataLoader`	`.pt`
pytorch	PyTorchModuleMaterializer	`torch.Module`	`.pt`
scipy	SparseMaterializer	`scipy.spmatrix`	`.npz`
spark	SparkDataFrameMaterializer	`pyspark.DataFrame`	`.parquet`
spark	SparkModelMaterializer	`pyspark.Transformer`	`pyspark.Estimator`
tensorflow	KerasMaterializer	`tf.keras.Model`	Directory
tensorflow	TensorflowDatasetMaterializer	`tf.Dataset`	Directory
whylogs	WhylogsMaterializer	`whylogs.DatasetProfileView`	`.pb`
xgboost	XgboostBoosterMaterializer	`xgb.Booster`	`.json`
xgboost	XgboostDMatrixMaterializer	`xgb.DMatrix`	`.binary`
jax	JAXArrayMaterializer	`jax.Array`	`.npy`
mlx	MLXArrayMaterializer	`mlx.core.array`	`.npy`

> **Note**: When using Docker-based orchestrators, you must specify the appropriate integrations in your `DockerSettings` to ensure the materializers are available inside the container. ## Creating Custom Materializers When working with custom data types, you'll need to create materializers to handle them. Here's how: ### 1. Define Your Materializer Class Create a new class that inherits from `BaseMaterializer`: ```python import os import json from typing import Type, Any, Dict from zenml.materializers.base_materializer import BaseMaterializer from zenml.enums import ArtifactType, VisualizationType from zenml.metadata.metadata_types import MetadataType # Assume MyClass is your custom class defined elsewhere # from mymodule import MyClass class MyClassMaterializer(BaseMaterializer): """Materializer for MyClass objects.""" # List the data types this materializer can handle ASSOCIATED_TYPES = (MyClass,) # Define what type of artifact this is (usually DATA or MODEL) ASSOCIATED_ARTIFACT_TYPE = ArtifactType.DATA def load(self, data_type: Type[Any]) -> MyClass: """Load MyClass from storage.""" # Implementation here filepath = os.path.join(self.uri, "data.json") with self.artifact_store.open(filepath, "r") as f: data = json.load(f) # Create and return an instance of MyClass return MyClass(**data) def save(self, data: MyClass) -> None: """Save MyClass to storage.""" # Implementation here filepath = os.path.join(self.uri, "data.json") with self.artifact_store.open(filepath, "w") as f: json.dump(data.to_dict(), f) def save_visualizations(self, data: MyClass) -> Dict[str, VisualizationType]: """Generate visualizations for the dashboard.""" # Optional - generate visualizations vis_path = os.path.join(self.uri, "visualization.html") with self.artifact_store.open(vis_path, "w") as f: f.write(data.to_html()) return {vis_path: VisualizationType.HTML} def extract_metadata(self, data: MyClass) -> Dict[str, MetadataType]: """Extract metadata for tracking.""" # Optional - extract metadata return { "name": data.name, "created_at": data.created_at, "num_records": len(data.records) } ``` ### 2. Using Your Custom Materializer Once you've defined the materializer, you can use it in your pipeline: ```python from zenml import step, pipeline # from mymodule import MyClass, MyClassMaterializer @step(output_materializers=MyClassMaterializer) def create_my_class() -> MyClass: """Create an instance of MyClass.""" return MyClass(name="test", records=[1, 2, 3]) @step def use_my_class(my_obj: MyClass) -> None: """Use the MyClass instance.""" print(f"Name: {my_obj.name}, Records: {my_obj.records}") @pipeline def custom_pipeline(): data = create_my_class() use_my_class(data) ``` ### 3. Multiple Outputs with Different Materializers When a step has multiple outputs that need different materializers: ```python from typing import Tuple, Annotated @step(output_materializers={ "obj1": MyClass1Materializer, "obj2": MyClass2Materializer }) def create_objects() -> Tuple[ Annotated[MyClass1, "obj1"], Annotated[MyClass2, "obj2"] ]: """Create instances of different classes.""" return MyClass1(), MyClass2() ``` ### 4. Registering a Materializer Globally You can register a materializer globally to override the default materializer for a specific type: ```python from zenml.materializers.materializer_registry import materializer_registry from zenml.materializers.base_materializer import BaseMaterializer import pandas as pd # Create a custom pandas materializer class FastPandasMaterializer(BaseMaterializer): # Implementation here ... # Register it for pandas DataFrames globally materializer_registry.register_and_overwrite_type( key=pd.DataFrame, type_=FastPandasMaterializer ) ``` ## Materializer Implementation Details When implementing a custom materializer, consider these aspects: ### Handling Storage The `self.uri` property contains the path to the directory where your artifact should be stored. Use this path to create files or subdirectories for your data. When reading or writing files, always use `self.artifact_store.open()` rather than direct file I/O to ensure compatibility with different artifact stores (local filesystem, cloud storage, etc.). ### Visualization Support The `save_visualizations()` method allows you to create visualizations that will be shown in the ZenML dashboard. You can return multiple visualizations of different types: * `VisualizationType.HTML`: Embedded HTML content * `VisualizationType.MARKDOWN`: Markdown content * `VisualizationType.IMAGE`: Image files * `VisualizationType.CSV`: CSV tables **Configuring Visualizations** Some materializers support configuration via environment variables to customize their visualization behavior. For example: * `ZENML_PANDAS_SAMPLE_ROWS`: Controls the number of rows shown in sample visualizations created by the `PandasMaterializer`. Default is 10 rows. ### Metadata Extraction The `extract_metadata()` method allows you to extract key information about your artifact for indexing and searching. This metadata will be displayed alongside the artifact in the dashboard. ### Temporary Files If you need a temporary directory while processing artifacts, use the `get_temporary_directory()` helper: ```python with self.get_temporary_directory() as temp_dir: # Process files in the temporary directory # Files will be automatically cleaned up ``` ### Example: A Complete Materializer Here's a complete example of a custom materializer for a simple class: ```python import os import json from typing import Type, Any, Dict from zenml.materializers.base_materializer import BaseMaterializer from zenml.enums import ArtifactType class MyObj: def __init__(self, name: str): self.name = name def to_dict(self): return {"name": self.name} @classmethod def from_dict(cls, data): return cls(name=data["name"]) class MyMaterializer(BaseMaterializer): """Materializer for MyObj objects.""" ASSOCIATED_TYPES = (MyObj,) ASSOCIATED_ARTIFACT_TYPE = ArtifactType.DATA def load(self, data_type: Type[Any]) -> MyObj: """Load MyObj from storage.""" filepath = os.path.join(self.uri, "data.json") with self.artifact_store.open(filepath, "r") as f: data = json.load(f) return MyObj.from_dict(data) def save(self, data: MyObj) -> None: """Save MyObj to storage.""" filepath = os.path.join(self.uri, "data.json") with self.artifact_store.open(filepath, "w") as f: json.dump(data.to_dict(), f) # Usage in a pipeline @step(output_materializers=MyMaterializer) def create_my_obj() -> MyObj: return MyObj(name="my_object") @step def use_my_obj(my_obj: MyObj) -> None: print(f"Object name: {my_obj.name}") @pipeline def my_pipeline(): obj = create_my_obj() use_my_obj(obj) ``` ## Unmaterialized artifacts Whenever you pass artifacts as outputs from one pipeline step to other steps as inputs, the corresponding materializer for the respective data type defines how this artifact is first serialized and written to the artifact store, and then deserialized and read in the next step.handle-custom-data-types. However, there are instances where you might **not** want to materialize an artifact in a step, but rather use a reference to it instead. This is where skipping materialization comes in. {% hint style="warning" %} Skipping materialization might have unintended consequences for downstream tasks that rely on materialized artifacts. Only skip materialization if there is no other way to do what you want to do. {% endhint %} #### How to skip materialization While materializers should in most cases be used to control how artifacts are returned and consumed from pipeline steps, you might sometimes need to have a completely unmaterialized artifact in a step, e.g., if you need to know the exact path to where your artifact is stored. An unmaterialized artifact is a [`zenml.materializers.UnmaterializedArtifact`](https://sdkdocs.zenml.io/latest/core_code_docs/core-artifacts.html#zenml.artifacts.unmaterialized_artifact). Among others, it has a property `uri` that points to the unique path in the artifact store where the artifact is persisted. One can use an unmaterialized artifact by specifying `UnmaterializedArtifact` as the type in the step: ```python from zenml.artifacts.unmaterialized_artifact import UnmaterializedArtifact from zenml import step @step def my_step(my_artifact: UnmaterializedArtifact): # rather than pd.DataFrame pass ``` The following shows an example of how unmaterialized artifacts can be used in the steps of a pipeline. The pipeline we define will look like this: ```shell s1 -> s3 s2 -> s4 ``` `s1` and `s2` produce identical artifacts, however `s3` consumes materialized artifacts while `s4` consumes unmaterialized artifacts. `s4` can now use the `dict_.uri` and `list_.uri` paths directly rather than their materialized counterparts. ```python from typing import Annotated from typing import Dict, List, Tuple from zenml.artifacts.unmaterialized_artifact import UnmaterializedArtifact from zenml import pipeline, step @step def step_1() -> Tuple[ Annotated[Dict[str, str], "dict_"], Annotated[List[str], "list_"], ]: return {"some": "data"}, [] @step def step_2() -> Tuple[ Annotated[Dict[str, str], "dict_"], Annotated[List[str], "list_"], ]: return {"some": "data"}, [] @step def step_3(dict_: Dict, list_: List) -> None: assert isinstance(dict_, dict) assert isinstance(list_, list) @step def step_4( dict_: UnmaterializedArtifact, list_: UnmaterializedArtifact, ) -> None: print(dict_.uri) print(list_.uri) @pipeline def example_pipeline(): step_3(*step_1()) step_4(*step_2()) example_pipeline() ``` You can see another example of using an `UnmaterializedArtifact` when triggering a [pipeline from another](/concepts/snapshots.md#advanced-usage-running-snapshots-from-other-pipelines). ## Best Practices When working with materializers: 1. **Prefer structured formats** over pickle or other binary formats for better cross-environment compatibility. 2. **Test your materializer** with different artifact stores (local, S3, etc.) to ensure it works consistently. 3. **Consider versioning** if your data structure might change over time. 4. **Create visualizations** to help users understand your artifacts in the dashboard. 5. **Extract useful metadata** to make artifacts easier to find and understand. 6. **Be explicit** about materializer assignments for clarity, even if ZenML can detect them automatically. 7. **Avoid using the CloudpickleMaterializer** in production as it's not reliable across different Python versions. ## Conclusion Materializers are a powerful part of ZenML's artifact system, enabling proper storage and handling of any data type. By creating custom materializers for your specific data structures, you ensure that your ML pipelines are robust, efficient, and can handle any data type required by your workflows.

--- # Agent Instructions This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com. ## Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter: ``` GET https://docs.zenml.io/concepts/artifacts/materializers.md?ask=&goal= ``` `ask` is the immediate question: it should be specific, self-contained, and written in natural language. `goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.