Some pipelines just need more - processing power, parallelism, permissions, you name it.
A common scenario on large datasets is distributed processing, e.g. via Apache Beam, Google Dataflow, Apache Spark, or other frameworks. In line with our integration-driven design philosophy, ZenML makes it easy to to distribute certain
Steps in a pipeline (e.g. in cases where large datasets are involved). All
Steps within a pipeline take as input a
The pattern to add a backend to a step is always the same:
backend = ... # define the backend you want to use pipeline.add_step( Step(...).with_backend(backend) )
Supported Processing Backends¶
ZenML is built on Apache Beam. You can simple use the
ProcessingBaseBackend, or extend ZenML with your own, custom backend.+
For convenience, ZenML supports a steadily growing number of processing backends out of the box:
ZenML natively supports Google Cloud Dataflow out of the box (as it’s built on Apache Beam).
Enable billing in your Google Cloud Platform project.
Make sure you have permissions to launch dataflow jobs, whether through service account or default credentials.
# Define the processing backend processing_backend = ProcessingDataFlowBackend( project=GCP_PROJECT, staging_location=os.path.join(GCP_BUCKET, 'dataflow_processing/staging'), ) # Reference the processing backend in steps # Add a split training_pipeline.add_split( RandomSplit(...).with_backend( processing_backend) ) # Add a preprocessing unit training_pipeline.add_preprocesser( StandardPreprocesser(...).with_backend(processing_backend) )