The Preprocesser defines how data is transformed before being sent to the Trainer for actual training.

Standard Preprocesser

ZenML comes equipped with a standard preprocesser that exposes an interface to standard preprocessing operations.

The Standard Preprocesser utilizes Tensorflow Transform and the Transform TFX component under-the-hood. Therefore, all functionality enabled by Tensorflow Transform can be utilized, which covers basically all simple Tensorflow operations and special Tensorflow Transform helper functions for common preprocessing methods (like normalizing). Find out all functionalities that come out-of-box with Tensorflow Transform here.

Using Tensorflow Transform has also the advantage of scale, as it utilizes Apache Beam under-the-hood to distribute the preprocessing.


In order to enable distributed computing, during pipeline run a user must had a Beam-compatible Preprocessing Backend like Google Dataflow.


Coming soon.

Create custom preprocesser

If the previous built-in options are not what you are looking for, there is also the option of implementing your own!

For this, ZenML provides the BasePreprocesserStep interface that one can subclass in a standard object-oriented manner to define your own custom split logic.

from zenml.core.steps.preprocesser.base_preprocesser import BasePreprocesserStep

class MyCustomPreprocesser(BasePreprocesserStep):

    def preprocessing_fn(self, inputs: dict):
        outputs = {}
        # your preprocessing logic goes here
        return outputs

If you are familiar with Tensorflow Transform, this is the same preprocessing_fn function that you see when creating Tensorflow Transform pipelines. This is because that is exactly what is being used under the hood.

The inputs parameter in the preprocessing_fn is a dict where keys are feature names and values are Tensorflow tensors which represent the values of the features. outputs is a dict where keys are transformed feature names and values are tensors with the transformed values of the features.

The conversion of inputs to outputs can be performed by applying any Tensorflow or Tensorflow Transform based method to the values of inputs and then populating outputs with the results.


Currently, the StandardPreprocesser is tied closely to Tensorflow Transform, and serves as a simple abstraction to it. In future releases, this will be decoupled. Note that non-Tensorflow Trainers can still consume from artifacts produced by this Step. The PyTorch trainer example illustrates this well.


Coming Soon. For now, please refer to extensive Tensorflow Transform documentation available online.