Processing Backends

Some pipelines just need more - processing power, parallelism, permissions, you name it.

A common scenario on large datasets is distributed processing, e.g. via Apache Beam, Google Dataflow, Apache Spark, or other frameworks. In line with our integration-driven design philosophy, ZenML makes it easy to to distribute certain Steps in a pipeline (e.g. in cases where large datasets are involved). All Steps within a pipeline take as input a ProcessingBackend.

Overview

The pattern to add a backend to a step is always the same:

backend = ...  # define the backend you want to use
pipeline.add_step(
    Step(...).with_backend(backend)
)

Supported Processing Backends

ZenML is built on Apache Beam. You can simple use the ProcessingBaseBackend, or extend ZenML with your own, custom backend.+

For convenience, ZenML supports a steadily growing number of processing backends out of the box:

Google Dataflow

ZenML natively supports Google Cloud Dataflow out of the box (as it’s built on Apache Beam).

Prequisites:

Usage:

# Define the processing backend
processing_backend = ProcessingDataFlowBackend(
    project=GCP_PROJECT,
    staging_location=os.path.join(GCP_BUCKET, 'dataflow_processing/staging'),
)

# Reference the processing backend in steps
# Add a split
training_pipeline.add_split(
    RandomSplit(...).with_backend(
        processing_backend)
)

# Add a preprocessing unit
training_pipeline.add_preprocesser(
    StandardPreprocesser(...).with_backend(processing_backend)
)