Google Cloud VertexAI Orchestrator
Orchestrating your pipelines to run on Vertex AI.
Last updated
Was this helpful?
Orchestrating your pipelines to run on Vertex AI.
Last updated
Was this helpful?
is a serverless ML workflow tool running on the Google Cloud Platform. It is an easy way to quickly run your code in a production-ready, repeatable cloud orchestrator that requires minimal setup without provisioning and paying for standby compute.
This component is only meant to be used within the context of a . Usage with a local ZenML deployment may lead to unexpected behavior!
You should use the Vertex orchestrator if:
you're already using GCP.
you're looking for a proven production-grade orchestrator.
you're looking for a UI in which you can track your pipeline runs.
you're looking for a managed solution for running your pipelines.
you're looking for a serverless solution for running your pipelines.
In order to use a Vertex AI orchestrator, you need to first deploy . It would be recommended to deploy ZenML in the same Google Cloud project as where the Vertex infrastructure is deployed, but it is not necessary to do so. You must ensure that you are connected to the remote ZenML server before using this stack component.
The only other thing necessary to use the ZenML Vertex orchestrator is enabling Vertex-relevant APIs on the Google Cloud project.
To use the Vertex orchestrator, we need:
The ZenML gcp
integration installed. If you haven't done so, run
The GCP project ID and location in which you want to run your Vertex AI pipelines.
You also have three different options to provide credentials to the orchestrator:
To understand what accounts you need to provision and why, let's look at the different components of the Vertex orchestrator:
As you can see, there can be dedicated service accounts involved in running a Vertex AI pipeline. That's two service accounts if you also use a service account to authenticate to GCP in the ZenML client environment. However, you can keep it simple and use the same service account everywhere.
gcloud
CLI with user accountThis is the easiest way to configure the Vertex AI Orchestrator, but it has the following drawbacks:
the setup is not portable on other machines and reproducible by other users.
it uses the Compute Engine default service account, which is not recommended, given that it has a lot of permissions by default and is used by many other GCP services.
We can then register the orchestrator as follows:
This use-case assumes you have already configured a GCP service account with the following permissions:
It also assumes you have already created a service account key for this service account and downloaded it to your local machine (e.g. in a connectors-vertex-ai-workload.json
file). This is not recommended if you are conscious about security. The principle of least privilege is not applied here and the environment in which the pipeline steps are running has many permissions that it doesn't need.
The following GCP service accounts are needed:
a "client" service account that has the following permissions:
A key is also needed for the "client" service account. You can create a key for this service account and download it to your local machine (e.g. in a connectors-vertex-ai-workload.json
file).
With the orchestrator registered, we can use it in our active stack:
You can now run any ZenML pipeline using the Vertex orchestrator:
Vertex comes with its own UI that you can use to find further details about your pipeline runs, such as the logs of your steps.
For any runs executed on Vertex, you can get the URL to the Vertex UI in Python using the following code snippet:
How to schedule a pipeline
The Vertex orchestrator only supports the cron_expression
, start_time
(optional) and end_time
(optional) parameters in the Schedule
object, and will ignore all other parameters supplied to define the schedule.
The start_time
and end_time
timestamp parameters are both optional and are to be specified in local time. They define the time window in which the pipeline runs will be triggered. If they are not specified, the pipeline will run indefinitely.
How to update/delete a scheduled pipeline
Note that ZenML only gets involved to schedule a run, but maintaining the lifecycle of the schedule is the responsibility of the user.
In order to cancel a scheduled Vertex pipeline, you need to manually delete the schedule in VertexAI (via the UI or the CLI). Here is an example (WARNING: Will delete all schedules if you run this):
For additional configuration of the Vertex orchestrator, you can pass VertexOrchestratorSettings
which allows you to configure labels for your Vertex Pipeline jobs or specify which GPU to use.
If your pipelines steps have certain hardware requirements, you can specify them as ResourceSettings
:
To run your pipeline (or some steps of it) on a GPU, you will need to set both a node selector and the GPU count as follows:
For more advanced hardware configuration, you can use VertexCustomJobParameters
to customize each step's execution environment. This allows you to specify detailed requirements like boot disk size, accelerator type, machine type, and more without needing a separate step operator.
You can also specify these parameters at pipeline level to apply them to all steps:
The VertexCustomJobParameters
supports the following common configuration options:
boot_disk_size_gb
Size of the boot disk in GB (default: 100)
boot_disk_type
Type of disk ("pd-standard", "pd-ssd", etc.)
machine_type
Machine type for computation (e.g., "n1-standard-4")
accelerator_type
Type of accelerator (e.g., "NVIDIA_TESLA_T4", "NVIDIA_TESLA_A100")
accelerator_count
Number of accelerators to attach
service_account
Service account to use for the job
persistent_resource_id
ID of persistent resource for faster job startup
For advanced scenarios, you can use additional_training_job_args
to pass additional parameters directly to the underlying Google Cloud Pipeline Components library:
If you specify parameters in additional_training_job_args
that are also defined as explicit attributes (like machine_type
or boot_disk_size_gb
), the values in additional_training_job_args
will override the explicit values. For example:
The resulting machine type will be "n1-standard-16". When this happens, ZenML will log a warning at runtime to alert you of the parameter override, which helps avoid confusion about which configuration values are actually being used.
Note that when using custom job parameters with persistent_resource_id
, you must always specify a service_account
as well.
Note that a service account with permissions to access the persistent resource is mandatory, so make sure to always include it in the configuration:
Navigate to the Stacks
section in your ZenML dashboard and either create a new Vertex orchestrator or update an existing one. During the creation/update, set the persistent resource ID and other values in the custom_job_parameters
attribute.
If you need to explicitly specify that no persistent resource should be used, set persistent_resource_id
to an empty string:
Using a persistent resource is particularly useful when you're developing locally and want to iterate quickly on steps that need cloud resources. The startup time of the job can be extremely quick.
When using persistent resources (persistent_resource_id
specified), you must always include a service_account
. Conversely, when explicitly setting persistent_resource_id=""
to avoid using persistent resources, ZenML will automatically set the service account to an empty string to avoid Vertex API errors - so don't set the service account in this case.
Remember that persistent resources continue to incur costs as long as they're running, even when idle. Make sure to monitor your usage and configure appropriate idle timeout periods.
installed and running.
A as part of your stack.
A as part of your stack.
This part is without doubt the most involved part of using the Vertex orchestrator. In order to run pipelines on Vertex AI, you need to have a GCP user account and/or one or more GCP service accounts set up with proper permissions, depending on whether you wish to practice and distribute permissions across multiple service accounts.
use the to authenticate locally with GCP
configure the orchestrator to use a to authenticate with GCP by setting the service_account_path
parameter in the orchestrator configuration.
(recommended) configure with GCP credentials and then link the Vertex AI Orchestrator stack component to the Service Connector.
This section involved in running a Vertex AI pipeline and what permissions they need, then provides instructions for three different configuration use-cases:
, including the ability to schedule pipelines
with all permissions, including the ability to schedule pipelines
for different permissions, including the ability to schedule pipelines
the ZenML client environment is the environment where you run the ZenML code responsible for building the pipeline Docker image and submitting the pipeline to Vertex AI, among other things. This is usually your local machine or some other environment used to automate running pipelines, like a CI/CD job. This environment needs to be able to authenticate with GCP and needs to have the necessary permissions to create a job in Vertex Pipelines, (e.g. ). If you are planning to , the ZenML client environment also needs additional permissions:
the to be able to write the pipeline JSON file to the artifact store directly (NOTE: not needed if the Artifact Store is configured with credentials or is linked to Service Connector)
the Vertex AI pipeline environment is the GCP environment in which the pipeline steps themselves are running in GCP. The Vertex AI pipeline runs in the context of a GCP service account which we'll call here the workload service account. The workload service account can be explicitly configured in the orchestrator configuration via the workload_service_account
parameter. If it is omitted, the orchestrator will use for the GCP project in which the pipeline is running. This service account needs to have the following permissions:
permissions to run a Vertex AI pipeline, (e.g. ).
This configuration use-case assumes you have configured the to authenticate locally with your GCP account (i.e. by running gcloud auth login
). It also assumes the following:
your GCP account has permissions to create a job in Vertex Pipelines, (e.g. ).
for the GCP project in which the pipeline is running is updated with additional permissions required to run a Vertex AI pipeline, (e.g. ).
permissions to create a job in Vertex Pipelines, (e.g. ).
permissions to run a Vertex AI pipeline, (e.g. ).
the to be able to write the pipeline JSON file to the artifact store directly.
This setup applies the principle of least privilege by using different service accounts with the minimum of permissions needed for . It also uses a GCP Service Connector to make the setup portable and reproducible. This configuration is a best-in-class setup that you would normally use in production, but it requires a lot more work to prepare.
This setup involves creating and configuring several GCP service accounts, which is a lot of work and can be error prone. If you don't really need the added security, you can use instead.
permissions to create a job in Vertex Pipelines, (e.g. ).
permissions to create a Google Cloud Function (e.g. with the ).
the to be able to write the pipeline JSON file to the artifact store directly (NOTE: not needed if the Artifact Store is configured with credentials or is linked to Service Connector).
a "workload" service account that has permissions to run a Vertex AI pipeline, (e.g. ).
With all the service accounts and the key ready, we can register and Vertex AI orchestrator as follows:
ZenML will build a Docker image called <CONTAINER_REGISTRY_URI>/zenml:<PIPELINE_NAME>
which includes your code and use it to run your pipeline steps in Vertex AI. Check out if you want to learn more about how ZenML builds these images and how you can customize them.
The Vertex Pipelines orchestrator supports running pipelines on a schedule using its .
The cron_expression
parameter . For example, the expression TZ=Europe/Paris 0 10 * * *
will trigger runs at 10:00 in the Europe/Paris timezone.
You can find available accelerator types .
These advanced parameters are passed directly to the Google Cloud Pipeline Components library's function. This approach lets you access new features of the Google API without requiring ZenML updates.
For a complete list of parameters supported by the underlying function, refer to the .
Note that if you wish to use this orchestrator to run steps on a GPU, you will need to follow to ensure that it works. It requires adding some extra settings customization and is essential to enable CUDA for the GPU to give its full acceleration.
When developing ML pipelines that use Vertex AI, the startup time for each step can be significant since Vertex needs to provision new compute resources for each run. To speed up development iterations, you can use Vertex AI's feature, which keeps compute resources warm between runs.
To use persistent resources with the Vertex orchestrator, you first need to create a persistent resource using the GCP Cloud UI, or by . Next, you'll need to configure your orchestrator to run on the persistent resource. This can be done either through the dashboard or CLI in which case it applies to all pipelines that will be run using this orchestrator, or dynamically in code for a specific pipeline or even just single steps.