Run Steps on Specialized Hardware
Execute individual steps in specialized environments.
The step operator defers the execution of individual steps in a pipeline to specialized runtime environments that are optimized for Machine Learning workloads. This is helpful when there is a requirement for specialized cloud backends for different steps. One example could be using powerful GPU instances for training jobs or distributed compute for ingestion streams.
While an orchestrator defines how and where your entire pipeline runs, a step operator defines how and where an individual step runs.
An operator can be registered as follows:
1
zenml step-operator register OPERATOR_NAME \
2
--flavor=OPERATOR_TYPE
3
...
Copied!
1
from zenml.steps import step
2
3
@step(custom_step_operator=OPERATOR_NAME)
4
def trainer(...) -> ...:
5
"""Train a model"""
6
# This step will run in environment specified by operator
Copied!

Pre-built Step Operators

ZenML has support for some pre-built step operators, namely:
  • AWS Sagemaker
  • AzureML
  • Vertex AI
These operators are all to be used for the use-case of running a training job on specialized cloud backends.
AzureML
Amazon SageMaker
GCP Vertex AI
  • First, you require a Machine learning resource on Azure . If you don't already have one, you can create it through the All resources page on the Azure portal.
  • Once your resource is created, you can head over to the Azure Machine Learning Studio and create a compute cluster to run your pipelines.
  • Next, you will need an environment for your pipelines. You can simply create one following the guide here .
  • Finally, we have to set up our artifact store. In order to do this, we need to create a blob container on Azure.
Optionally, you can also create a Service Principal for authentication . This might especially be useful if you are planning to orchestrate your pipelines in a non-local setup such as Kubeflow where your local authentication won't be accessible.
The command to register the stack component would look like the following. More details about the parameters that you can configure can be found in the class definition of Azure Step Operator in the API docs (https://apidocs.zenml.io/).
1
zenml step-operator register azureml \
2
--flavor=azureml \
3
--subscription_id=<AZURE_SUBSCRIPTION_ID> \
4
--resource_group=<AZURE_RESOURCE_GROUP> \
5
--workspace_name=<AZURE_WORKSPACE_NAME> \
6
--compute_target_name=<AZURE_COMPUTE_TARGET_NAME> \
7
--environment_name=<AZURE_ENVIRONMENT_NAME>
Copied!
  • First, you need to create a role in the IAM console that you want the jobs running in Sagemaker to assume. This role should at least have the AmazonS3FullAccess and AmazonSageMakerFullAccess policies applied. Check this link to learn how to create a role.
  • Next, you need to choose what instance type needs to be used to run your jobs. You can get the list here .
  • Optionally, you can choose an S3 bucket to which Sagemaker should output any artifacts from your training run.
  • You can also supply an experiment name if you have one created already. Check this guide to know how. If not provided, the job runs would be independent.
  • You can also choose a custom docker image that you want ZenML to use as a base image for creating an environment to run your jobs in Sagemaker.
  • You need to have the aws cli set up with the right credentials. Make sure you have the permissions to create and manage Sagemaker runs.
  • A container registry has to be configured in the stack. This registry will be used by ZenML to push your job images that Sagemaker will run.
Once you have all these values handy, you can proceed to setting up the components required for your stack.
The command to register the stack component would look like the following. More details about the parameters that you can configure can be found in the class definition of Sagemaker Step Operator in the API docs (https://apidocs.zenml.io/).
1
zenml step-operator register sagemaker \
2
--flavor=sagemaker
3
--role=<SAGEMAKER_ROLE> \
4
--instance_type=<SAGEMAKER_INSTANCE_TYPE>
5
--base_image=<CUSTOM_BASE_IMAGE>
6
--bucket_name=<S3_BUCKET_NAME>
7
--experiment_name=<SAGEMAKER_EXPERIMENT_NAME>
Copied!
  • You need to have the gcp cli set up with the right credentials. Make sure you have the permissions to create and manage Vertex AI custom jobs. Preferably, you should create a service account with the right permissions to create Vertex AI jobs (roles/aiplatform.admin) and push to the Artifact/Container registry (roles/(roles/storage.admin). Then set the GOOGLE_APPLICATION_CREDENTIALS env variable to point to the service account file.
  • Next, you need to choose what instance type needs to be used to run your jobs. You can get the list here .
  • You can choose a GCP bucket to which Vertex should output any artifacts from your training run.
  • You can also choose a custom docker image that you want ZenML to use as a base image for creating an environment to run your jobs on Vertex AI.
  • A container registry has to be configured in the stack. This registry will be used by ZenML to push your job images that Vertex will use. Check out the cloud guide to learn how you can set up an GCP container registry.
Once you have all these values handy, you can proceed to setting up the components required for your stack.
The command to register the stack component would look like the following. More details about the parameters that you can configure can be found in the class definition of Vertex Step Operator in the API docs (https://apidocs.zenml.io/).
1
zenml step-operator register vertex \
2
--flavor=vertex \
3
--project=zenml-core \
4
--service_account_path=... \
5
--region=europe-west1 \
6
--machine_type=n1-standard-4 \
7
--base_image=<CUSTOM_BASE_IMAGE> \
8
--accelerator_type=...
Copied!
A concrete example of using these step operators can be found here

Building your own StepOperator

To have ZenML run your steps in your own backend, all you need to do is implement the BaseStepOperator class with the code that sets up your environment and submits the ZenML entrypoint command to it.
1
from abc import ABC, abstractmethod
2
from typing import List
3
from zenml.stack import StackComponent
4
5
class BaseStepOperator(StackComponent, ABC):
6
"""Base class for all ZenML step operators."""
7
8
...
9
...
10
11
12
@abstractmethod
13
def launch(
14
self,
15
pipeline_name: str,
16
run_name: str,
17
requirements: List[str],
18
entrypoint_command: List[str],
19
) -> None:
20
"""Abstract method to execute a step.
21
Concrete step operator subclasses must implement the following
22
functionality in this method:
23
- Prepare the execution environment and install all the necessary
24
`requirements`
25
- Launch a **synchronous** job that executes the `entrypoint_command`
26
Args:
27
pipeline_name: Name of the pipeline which the step to be executed
28
is part of.
29
run_name: Name of the pipeline run which the step to be executed
30
is part of.
31
entrypoint_command: Command that executes the step.
32
requirements: List of pip requirements that must be installed
33
inside the step operator environment.
34
"""
35
# Write custom logic here.
Copied!
The launch method is what gets called when ZenML is executing your step and it is responsible for any logic that pertains to your custom backend.
Export as PDF
Copy link
Edit on GitHub