Google Cloud VertexAI Orchestrator

Orchestrating your pipelines to run on Vertex AI.

Vertex AI Pipelines is a serverless ML workflow tool running on the Google Cloud Platform. It is an easy way to quickly run your code in a production-ready, repeatable cloud orchestrator that requires minimal setup without provisioning and paying for standby compute.

This component is only meant to be used within the context of a remote ZenML deployment scenario. Usage with a local ZenML deployment may lead to unexpected behavior!

When to use it

You should use the Vertex orchestrator if:

you're already using GCP.
you're looking for a proven production-grade orchestrator.
you're looking for a UI in which you can track your pipeline runs.
you're looking for a managed solution for running your pipelines.
you're looking for a serverless solution for running your pipelines.

How to deploy it

In order to use a Vertex AI orchestrator, you need to first deploy ZenML to the cloud. It would be recommended to deploy ZenML in the same Google Cloud project as where the Vertex infrastructure is deployed, but it is not necessary to do so. You must ensure that you are connected to the remote ZenML server before using this stack component.

The only other thing necessary to use the ZenML Vertex orchestrator is enabling Vertex-relevant APIs on the Google Cloud project.

In order to quickly enable APIs, and create other resources necessary for using this integration, you can also consider using the Vertex AI stack recipe, which helps you set up the infrastructure with one click.

How to use it

To use the Vertex orchestrator, we need:

The ZenML gcp integration installed. If you haven't done so, run
```
zenml integration install gcp
```
Docker installed and running.
A remote artifact store as part of your stack.
A remote container registry as part of your stack.
GCP credentials with proper permissions
The GCP project ID and location in which you want to run your Vertex AI pipelines.

GCP credentials and permissions

This part is without doubt the most involved part of using the Vertex orchestrator. In order to run pipelines on Vertex AI, you need to have a GCP user account and/or one or more GCP service accounts set up with proper permissions, depending on whether you want to schedule pipelines and depending on whether you wish to practice the principle of least privilege and distribute permissions across multiple service accounts.

You also have three different options to provide credentials to the orchestrator:

use the gcloud CLI to authenticate locally with GCP
configure the orchestrator to use a service account key file to authenticate with GCP by setting the service_account_path parameter in the orchestrator configuration.
(recommended) configure a GCP Service Connector with GCP credentials and then link the Vertex AI Orchestrator stack component to the Service Connector.

This section explains the different components and GCP resources involved in running a Vertex AI pipeline and what permissions they need, then provides instructions for three different configuration use-cases:

use the local gcloud CLI configured with your GCP user account, without the ability to schedule pipelines
use a GCP Service Connector and a single service account with all permissions, including the ability to schedule pipelines
use a GCP Service Connector and multiple service accounts for different permissions, including the ability to schedule pipelines

Vertex AI pipeline components

To understand what accounts you need to provision and why, let's look at the different components of the Vertex orchestrator:

the ZenML client environment is the environment where you run the ZenML code responsible for building the pipeline Docker image and submitting the pipeline to Vertex AI, among other things. This is usually your local machine or some other environment used to automate running pipelines, like a CI/CD job. This environment needs to be able to authenticate with GCP and needs to have the necessary permissions to create a job in Vertex Pipelines, (e.g. the Vertex AI User role). If you are planning to run pipelines on a schedule, the ZenML client environment also needs additional permissions:
- permissions to create a Google Cloud Function (e.g. with the Cloud Functions Developer Role).
- permissions to create a Google Cloud Scheduler (e.g. with the Cloud Scheduler Admin Role).
- the Storage Object Creator Role to be able to write the pipeline JSON file to the artifact store directly (NOTE: not needed if the Artifact Store is configured with credentials or is linked to Service Connector)
the Vertex AI pipeline environment is the GCP environment in which the pipeline steps themselves are running in GCP. The Vertex AI pipeline runs in the context of a GCP service account which we'll call here the workload service account. The workload service account can be explicitly configured in the orchestrator configuration via the workload_service_account parameter. If it is omitted, the orchestrator will use the Compute Engine default service account for the GCP project in which the pipeline is running. This service account needs to have the following permissions:
- permissions to run a Vertex AI pipeline, (e.g. the Vertex AI Service Agent role).
the scheduler Google Cloud Function is a GCP resource that is used to trigger the pipeline on a schedule. This component is only needed if you intend on running Vertex AI pipelines on a schedule. The scheduler function runs in the context of a GCP service account which we'll call here the function service account. The function service account can be explicitly configured in the orchestrator configuration via the function_service_account parameter. If it is omitted, the orchestrator will use the Compute Engine default service account for the GCP project in which the pipeline is running. This service account needs to have the following permissions:
- permissions to create a job in Vertex Pipelines, (e.g. the Vertex AI User role).
the Google Cloud Scheduler is a GCP resource that is used to trigger the pipeline on a schedule. This component is only needed if you intend on running Vertex AI pipelines on a schedule. The scheduler needs a GCP service account to authenticate to the scheduler Google Cloud Function. Let's call this service account the scheduler service account. The scheduler service account can be explicitly configured in the orchestrator configuration via the scheduler_service_account parameter. If it is omitted, the orchestrator will use the following, in order of precedence:
- the service account used by the ZenML client environment credentials, if present.
- the service account specified in the function_service_account parameter.
- the service account specified in the workload_service_account parameter.

The scheduler service account must have the following permissions:

permissions to trigger the scheduler function, (e.g. the Cloud Functions Invoker role and the Cloud Run Invoker role).
the Storage Object Viewer Role to be able to read the pipeline JSON file from the artifact store.

As you can see, there can be as many as three different service accounts involved in running a Vertex AI pipeline. Four, if you also use a service account to authenticate to GCP in the ZenML client environment. However, you can keep it simple an use the same service account everywhere.

Configuration use-case: local `gcloud` CLI with user account

This configuration use-case assumes you have configured the gcloud CLI to authenticate locally with your GCP account (i.e. by running gcloud auth login). It also assumes the following:

you are not planning to run pipelines on a schedule.
your GCP account has permissions to create a job in Vertex Pipelines, (e.g. the Vertex AI User role).
the Compute Engine default service account for the GCP project in which the pipeline is running is updated with additional permissions required to run a Vertex AI pipeline, (e.g. the Vertex AI Service Agent role).

This is the easiest way to configure the Vertex AI Orchestrator, but it has the following drawbacks:

you can't run pipelines on a schedule.
the setup is not portable on other machines and reproducible by other users.
it uses the Compute Engine default service account, which is not recommended, given that it has a lot of permissions by default and is used by many other GCP services.

We can then register the orchestrator as follows:

zenml orchestrator register <ORCHESTRATOR_NAME> \
    --flavor=vertex \
    --project=<PROJECT_ID> \
    --location=<GCP_LOCATION> \
    --synchronous=true

Configuration use-case: GCP Service Connector with single service account

This configuration uses a single GCP service account that has all the permissions needed to run and/or schedule a Vertex AI pipeline. This configuration is useful if you want to run pipelines on a schedule, but don't want to use the Compute Engine default service account. Using a Service Connector brings the added benefit of making your pipeline fully portable.

This use-case assumes you have already configured a GCP service account with the following permissions:

permissions to create a job in Vertex Pipelines, (e.g. the Vertex AI User role).
permissions to run a Vertex AI pipeline, (e.g. the Vertex AI Service Agent role).
permissions to create a Google Cloud Function (e.g. with the Cloud Functions Developer Role).
the Storage Object Creator Role to be able to write the pipeline JSON file to the artifact store directly.
permissions to trigger the scheduler function, (e.g. the Cloud Functions Invoker role and the Cloud Run Invoker role).
permissions to create a Google Cloud Scheduler job (e.g. with the Cloud Scheduler Admin Role).

It also assumes you have already created a service account key for this service account and downloaded it to your local machine (e.g. in a connectors-vertex-ai-workload.json file).

This setup is portable and reproducible, but it throws all the permissions in a single service account, which is not recommended if you are conscious about security. The principle of least privilege is not applied here and the environment in which the pipeline steps are running has too many permissions that it doesn't need.

We can then register the GCP Service Connector and Vertex AI orchestrator as follows:

zenml service-connector register <CONNECTOR_NAME> --type gcp --auth-method=service-account --project_id=<PROJECT_ID> --service_account_json=@connectors-vertex-ai-workload.json --resource-type gcp-generic

zenml orchestrator register <ORCHESTRATOR_NAME> \
    --flavor=vertex \
    --location=<GCP_LOCATION> \
    --synchronous=true \
    --workload_service_account=<SERVICE_ACCOUNT_NAME>@<PROJECT_NAME>.iam.gserviceaccount.com \
    --function_service_account=<SERVICE_ACCOUNT_NAME>@<PROJECT_NAME>.iam.gserviceaccount.com \
    --scheduler_service_account=<SERVICE_ACCOUNT_NAME>@<PROJECT_NAME>.iam.gserviceaccount.com

zenml orchestrator connect <ORCHESTRATOR_NAME> --connector <CONNECTOR_NAME>

Configuration use-case: GCP Service Connector with different service accounts

This setup applies the principle of least privilege by using different service accounts with the minimum of permissions needed for the different components involved in running a Vertex AI pipeline. It also uses a GCP Service Connector to make the setup portable and reproducible. This configuration is a best-in-class setup that you would normally use in production, but it requires a lot more work to prepare.

This setup involves creating and configuring several GCP service accounts, which is a lot of work and can be error prone. If you don't really need the added security, you can use the GCP Service Connector with a single service account instead.

The following GCP service accounts are needed:

a "client" service account that has the following permissions:
- permissions to create a job in Vertex Pipelines, (e.g. the Vertex AI User role).
- permissions to create a Google Cloud Function (e.g. with the Cloud Functions Developer Role).
- permissions to create a Google Cloud Scheduler job (e.g. with the Cloud Scheduler Admin Role).
- the Storage Object Creator Role to be able to write the pipeline JSON file to the artifact store directly (NOTE: not needed if the Artifact Store is configured with credentials or is linked to Service Connector).
a "workload" service account that has permissions to run a Vertex AI pipeline, (e.g. the Vertex AI Service Agent role).
a "function" service account that has the following permissions:
- permissions to create a job in Vertex Pipelines, (e.g. the Vertex AI User role).
- the Storage Object Viewer Role to be able to read the pipeline JSON file from the artifact store.
The "client" service account also needs to be granted the iam.serviceaccounts.actAs permission on this service account (i.e. the "client" service account needs the Service Account User role on the "function" service account). Similarly, the "function" service account also needs to be granted the iam.serviceaccounts.actAs permission on the "workload" service account.
a "scheduler" service account that has permissions to trigger the scheduler function, (e.g. the Cloud Functions Invoker role and the Cloud Run Invoker role). The "client" service account also needs to be granted the iam.serviceaccounts.actAs permission on this service account (i.e. the "client" service account needs the Service Account User role on the "scheduler" service account).

A key is also needed for the "client" service account. You can create a key for this service account and download it to your local machine (e.g. in a connectors-vertex-ai-workload.json file).

With all the service accounts and the key ready, we can register the GCP Service Connector and Vertex AI orchestrator as follows:

zenml service-connector register <CONNECTOR_NAME> --type gcp --auth-method=service-account --project_id=<PROJECT_ID> --service_account_json=@connectors-vertex-ai-workload.json --resource-type gcp-generic

zenml orchestrator register <ORCHESTRATOR_NAME> \
    --flavor=vertex \
    --location=<GCP_LOCATION> \
    --synchronous=true \
    --workload_service_account=<WORKLOAD_SERVICE_ACCOUNT_NAME>@<PROJECT_NAME>.iam.gserviceaccount.com \
    --function_service_account=<FUNCTION_SERVICE_ACCOUNT_NAME>@<PROJECT_NAME>.iam.gserviceaccount.com \
    --scheduler_service_account=<SCHEDULER_SERVICE_ACCOUNT_NAME>@<PROJECT_NAME>.iam.gserviceaccount.com

zenml orchestrator connect <ORCHESTRATOR_NAME> --connector <CONNECTOR_NAME>

Configuring the stack

With the orchestrator registered, we can use it in our active stack:

# Register and activate a stack with the new orchestrator
zenml stack register <STACK_NAME> -o <ORCHESTRATOR_NAME> ... --set

ZenML will build a Docker image called <CONTAINER_REGISTRY_URI>/zenml:<PIPELINE_NAME> which includes your code and use it to run your pipeline steps in Vertex AI. Check out this page if you want to learn more about how ZenML builds these images and how you can customize them.

You can now run any ZenML pipeline using the Vertex orchestrator:

python file_that_runs_a_zenml_pipeline.py

Vertex UI

Vertex comes with its own UI that you can use to find further details about your pipeline runs, such as the logs of your steps.

For any runs executed on Vertex, you can get the URL to the Vertex UI in Python using the following code snippet:

from zenml.client import Client

pipeline_run = Client().get_pipeline_run("<PIPELINE_RUN_NAME>")
orchestrator_url = pipeline_run.run_metadata["orchestrator_url"].value

Run pipelines on a schedule

The Vertex Pipelines orchestrator supports running pipelines on a schedule, using logic resembling the official approach recommended by GCP.

ZenML utilizes the Cloud Scheduler and Cloud Functions services to enable scheduling on Vertex Pipelines. The following is the sequence of events that happen when running a pipeline on Vertex with a schedule:

A docker image is created and pushed (see above containerization).
The Vertex AI pipeline JSON file is copied to the Artifact Store specified in your Stack
Cloud Function is created that creates the Vertex Pipeline job when triggered.
A Cloud Scheduler job is created that triggers the Cloud Function on the defined schedule.

Therefore, to run on a schedule, the client environment needs additional permissions and a GCP service account at least is required for the Cloud Scheduler job to be able to authenticate with the Cloud Function, as explained in the GCP credentials and permissions section.

How to schedule a pipeline

from zenml.config.schedule import Schedule

# Run a pipeline every 5th minute
pipeline_instance.run(
    schedule=Schedule(
        cron_expression="*/5 * * * *"
    )
)

The Vertex orchestrator only supports the cron_expression parameter in the Schedule object, and will ignore all other parameters supplied to define the schedule.

How to delete a scheduled pipeline

Note that ZenML only gets involved to schedule a run, but maintaining the lifecycle of the schedule is the responsibility of the user.

In order to cancel a scheduled Vertex pipeline, you need to manually delete the generated Google Cloud Function, along with the Cloud Scheduler job that schedules it (via the UI or the CLI).

Additional configuration

For additional configuration of the Vertex orchestrator, you can pass VertexOrchestratorSettings which allows you to configure node selectors, affinity, and tolerations to apply to the Kubernetes Pods running your pipeline. These can be either specified using the Kubernetes model objects or as dictionaries.

from zenml.integrations.gcp.flavors.vertex_orchestrator_flavor import VertexOrchestratorSettings
from kubernetes.client.models import V1Toleration

vertex_settings = VertexOrchestratorSettings(
    pod_settings={
        "affinity": {
            "nodeAffinity": {
                "requiredDuringSchedulingIgnoredDuringExecution": {
                    "nodeSelectorTerms": [
                        {
                            "matchExpressions": [
                                {
                                    "key": "node.kubernetes.io/name",
                                    "operator": "In",
                                    "values": ["my_powerful_node_group"],
                                }
                            ]
                        }
                    ]
                }
            }
        },
        "tolerations": [
            V1Toleration(
                key="node.kubernetes.io/name",
                operator="Equal",
                value="",
                effect="NoSchedule"
            )
        ]
    }
)

If your pipelines steps have certain hardware requirements, you can specify them as ResourceSettings:

resource_settings = ResourceSettings(cpu_count=8, memory="16GB")

These settings can then be specified on either pipeline-level or step-level:

# Either specify on pipeline-level
@pipeline(
    settings={
        "orchestrator.vertex": vertex_settings,
        "resources": resource_settings,
    }
)
def my_pipeline():
    ...

# OR specify settings on step-level
@step(
    settings={
        "orchestrator.vertex": vertex_settings,
        "resources": resource_settings,
    }
)
def my_step():
    ...

Check out the SDK docs for a full list of available attributes and this docs page for more information on how to specify settings.

For more information and a full list of configurable attributes of the Vertex orchestrator, check out the API Docs .

Enabling CUDA for GPU-backed hardware

Note that if you wish to use this orchestrator to run steps on a GPU, you will need to follow the instructions on this page to ensure that it works. It requires adding some extra settings customization and is essential to enable CUDA for the GPU to give its full acceleration.

PreviousKubernetes Orchestrator NextAWS Sagemaker Orchestrator

Last updated 4 months ago