Self-hosted deployment

Guide for installing ZenML Pro self-hosted in a Kubernetes cluster.

This page provides instructions for installing ZenML Pro - the ZenML Pro Control Plane and one or more ZenML Pro Workspace servers - on-premise in a Kubernetes cluster. For more general information on deploying ZenML, visit our documentation where we explain the different options you have.

Overview

ZenML Pro can be installed as a self-hosted deployment. You need to be granted access to the ZenML Pro container images and you'll have to provide your own infrastructure: a Kubernetes cluster, a database server and a few other common prerequisites usually needed to expose Kubernetes services via HTTPs - a load balancer, an Ingress controller, HTTPs certificate(s) and DNS rule(s).

This document will guide you through the process.

Please note that the SSO (Single Sign-On) feature is currently not available in the on-prem version of ZenML Pro. This feature is on our roadmap and will be added in future releases.

Preparation and prerequisites

Software Artifacts

The ZenML Pro on-prem installation relies on a set of container images and Helm charts. The container images are stored in private ZenML container registries that are not available to the public.

If you haven't done so already, please book a demo to get access to the private ZenML Pro container images.

ZenML Pro Control Plane Artifacts

The following artifacts are required to install the ZenML Pro control plane in your own Kubernetes cluster:

  • private container images for the ZenML Pro API server:

    • 715803424590.dkr.ecr.eu-west-1.amazonaws.com/zenml-pro-api in AWS

    • europe-west3-docker.pkg.dev/zenml-cloud/zenml-pro/zenml-pro-api in GCP

  • private container images for the ZenML Pro dashboard:

    • 715803424590.dkr.ecr.eu-west-1.amazonaws.com/zenml-pro-dashboard in AWS

    • europe-west3-docker.pkg.dev/zenml-cloud/zenml-pro/zenml-pro-dashboard in GCP

  • the public ZenML Pro helm chart (as an OCI artifact): oci://public.ecr.aws/zenml/zenml-pro

The container image tags and the Helm chart versions are both synchronized and linked to the ZenML Pro releases. You can find the ZenML Pro Helm chart along with the available released versions in the ZenML Pro ArtifactHub repository.

If you're planning on copying the container images to your own private registry (recommended if your Kubernetes cluster isn't running on AWS and can't authenticate directly to the ZenML Pro container registry) make sure to include and keep the same tags.

By default, the ZenML Pro Helm chart uses the same container image tags as the helm chart version. Configuring custom container image tags when setting up your Helm distribution is also possible, but not recommended because it doesn't yield reproducible results and may even cause problems if used with the wrong Helm chart version.

ZenML Pro Workspace Server Artifacts

The following artifacts are required to install ZenML Pro workspace servers in your own Kubernetes cluster:

  • private container images for the ZenML Pro workspace server:

    • 715803424590.dkr.ecr.eu-central-1.amazonaws.com/zenml-pro-server in AWS

    • europe-west3-docker.pkg.dev/zenml-cloud/zenml-pro/zenml-pro-server in GCP

  • the public open-source ZenML Helm chart (as an OCI artifact): oci://public.ecr.aws/zenml/zenml

The container image tags and the Helm chart versions are both synchronized and linked to the ZenML open-source releases. To find the latest ZenML OSS release, please check the ZenML OSS ArtifactHub repository (Helm chart versions) or the ZenML release page.

If you're planning on copying the container images to your own private registry (recommended if your Kubernetes cluster isn't running on AWS and can't authenticated directly to the ZenML Pro container registry) make sure to include and keep the same tags.

By default, the ZenML OSS Helm chart uses the same container image tags as the helm chart version. Configuring custom container image tags when setting up your Helm distribution is also possible, but not recommended because it doesn't yield reproducible results and may even cause problems if used with the wrong Helm chart version.

ZenML Pro Client Artifacts

If you're planning on running containerized ZenML pipelines, or using other containerization related ZenML features, you'll also need to access the public ZenML client container image located in Docker Hub at zenmldocker/zenml. This isn't a problem unless you're deploying ZenML Pro in an air-gapped environment, in which case you'll also have to copy the client container image into your own container registry. You'll also have to configure your code to use the correct base container registry via DockerSettings (see the DockerSettings documentation for more information).

Accessing the ZenML Pro Container Images

This section provides instructions for how to access the private ZenML Pro container images.

Currently, ZenML Pro container images are only available in AWS Elastic Container Registry (ECR) and Google Cloud Platform (GCP) Artifact Registry. Support for Azure Container Registry (ACR) is on our roadmap and will be added soon.

The ZenML support team can provide credentials upon request, which can be used to pull these images without the need to set up any cloud provider accounts or resources. Contact support if you'd prefer this option.

AWS

To access the ZenML Pro container images stored in AWS ECR, you need to set up an AWS IAM user or IAM role in your AWS account. The steps below outline how to create an AWS account, configure the necessary IAM entities, and pull images from the private repositories. If you're familiar with AWS or even plan on using an AWS EKS cluster to deploy ZenML Pro, then you can simply use your existing IAM user or IAM role and skip steps 1. and 2.


  • Step 1: Create a Free AWS Account

    1. Click Create a Free Account.

    2. Follow the on-screen instructions to provide your email address, create a root user, and set a secure password.

    3. Enter your contact and payment information for verification purposes. While a credit or debit card is required, you won't be charged for free-tier eligible services.

    4. Confirm your email and complete the verification process.

    5. Log in to the AWS Management Console using your root user credentials.

  • Step 2: Create an IAM User or IAM Role

    A. Create an IAM User

    1. Log in to the AWS Management Console.

    2. Navigate to the IAM service.

    3. Click Users in the left-hand menu, then click Add Users.

    4. Provide a user name (e.g., zenml-ecr-access).

    5. Select Access Key - Programmatic access as the AWS credential type.

    6. Click Next: Permissions.

    7. Choose Attach policies directly, then select the following policies:

      • AmazonEC2ContainerRegistryReadOnly

    8. Click Next: Tags and optionally add tags for organization purposes.

    9. Click Next: Review, then Create User.

    10. Note the Access Key ID and Secret Access Key displayed after creation. Save these securely.

    B. Create an IAM Role

    1. Navigate to the IAM service.

    2. Click Roles in the left-hand menu, then click Create Role.

    3. Choose the type of trusted entity:

      • Select AWS Account.

    4. Enter your AWS account ID and click Next.

    5. Select the AmazonEC2ContainerRegistryReadOnly policy.

    6. Click Next: Tags, optionally add tags, then click Next: Review.

    7. Provide a role name (e.g., zenml-ecr-access-role) and click Create Role.

  • Step 3: Provide the IAM User/Role ARN

    1. For an IAM user, the ARN can be found in the Users section under the Summary tab.

    2. For an IAM role, the ARN is displayed in the Roles section under the Summary tab.

    Send the ARN to ZenML Support so it can be granted permission to access the ZenML Pro container images and Helm charts.

  • Step 4: Authenticate your Docker Client

    Run these steps on the machine that you'll use to pull the ZenML Pro images. It is recommended that you copy the container images into your own container registry that will be accessible from the Kubernetes cluster where ZenML Pro will be stored, otherwise you'll have to find a way to configure the Kubernetes cluster to authenticate directly to the ZenML Pro container registry and that will be problematic if your Kubernetes cluster is not running on AWS.

    A. Install AWS CLI

    1. Follow the instructions to install the AWS CLI: AWS CLI Installation Guide.

    B. Configure AWS CLI Credentials

    1. Open a terminal and run aws configure

    2. Enter the following when prompted:

      • Access Key ID: Provided during IAM user creation.

      • Secret Access Key: Provided during IAM user creation.

      • Default region name: eu-west-1

      • Default output format: Leave blank or enter json.

    3. If you chose to use an IAM role, update the AWS CLI configuration file to specify the role you want to assume. Open the configuration file located at ~/.aws/config and add the following:

      Replace <IAM-ROLE-ARN> with the ARN of the role you created and ensure source_profile points to a profile with sufficient permissions to assume the role.

    C. Authenticate Docker with ECR

    Run the following command to authenticate your Docker client with the ZenML ECR repository:

    If you used an IAM role, use the specified profile to execute commands. For example:

    This will allow you to authenticate to the ZenML Pro container registries and pull the necessary images with Docker, e.g.:

To decide which tag to use, you should check:

Note that the zenml-pro-api and zenml-pro-dashboard images are stored in the eu-west-1 region, while the zenml-pro-server image is stored in the eu-central-1 region.

GCP

To access the ZenML Pro container images stored in Google Cloud Platform (GCP) Artifact Registry, you need to set up a GCP account and configure the necessary permissions. The steps below outline how to create a GCP account, configure authentication, and pull images from the private repositories. If you're familiar with GCP or plan on using a GKE cluster to deploy ZenML Pro, you can use your existing GCP account and skip step 1.


  • Step 1: Create a GCP Account

    1. Click Get Started for Free or sign in with an existing Google account.

    2. Follow the on-screen instructions to set up your account and create a project.

    3. Set up billing information (required for using GCP services).

  • Step 2: Create a Service Account

    1. Navigate to the IAM & Admin > Service Accounts page in the Google Cloud Console.

    2. Click Create Service Account.

    3. Enter a service account name (e.g., zenml-gar-access).

    4. Add a description (optional) and click Create and Continue.

    5. No additional permissions are needed as access will be granted directly to the Artifact Registry.

    6. Click Done.

    7. After creation, click on the service account to view its details.

    8. Go to the Keys tab and click Add Key > Create new key.

    9. Choose JSON as the key type and click Create.

    10. Save the downloaded JSON key file securely - you'll need it later.

  • Step 3: Provide the Service Account Email

    1. In the service account details page, copy the service account email address (it should look like [email protected]).

    2. Send this email address to ZenML Support so it can be granted permission to access the ZenML Pro container images.

  • Step 4: Authenticate your Docker Client

    Run these steps on the machine that you'll use to pull the ZenML Pro images. It is recommended that you copy the container images into your own container registry that will be accessible from the Kubernetes cluster where ZenML Pro will be stored.

    A. Install Google Cloud CLI

    1. Follow the instructions to install the Google Cloud CLI.

    2. Initialize the CLI by running:

    B. Configure Authentication

    1. Activate the service account using the JSON key file you downloaded:

    2. Configure Docker authentication for Artifact Registry:

    C. Pull the Container Images

    You can now pull the ZenML Pro images:

To decide which tag to use, you should check:

Air-Gapped Installation

If you need to install ZenML Pro in an air-gapped environment (a network with no direct internet access), you'll need to transfer all required artifacts to your internal infrastructure. Here's a step-by-step process:

1. Prepare a Machine with Internet Access

First, you'll need a machine with both internet access and sufficient storage space to temporarily store all artifacts. On this machine:

  1. Follow the authentication steps described above to gain access to the private repositories

  2. Install the required tools:

    • Docker

    • Helm

2. Download All Required Artifacts

A Bash script like the following can be used to download all necessary components, or you can run the listed commands manually:

3. Transfer Artifacts to Air-Gapped Environment

  1. Copy the zenml-artifacts.tar.gz file to your preferred transfer medium (e.g., USB drive, approved file transfer system)

  2. Transfer the archive to a machine in your air-gapped environment that has access to your internal container registry

4. Load Artifacts in Air-Gapped Environment

Create a script to load the artifacts in your air-gapped environment or run the listed commands manually:

5. Update Configuration

When deploying ZenML Pro in your air-gapped environment, make sure to update all references to container images in your Helm values to point to your internal registry. For example:

Remember to maintain the same version tags when copying images to your internal registry to ensure compatibility between components.

6. Using the Helm Charts

After downloading the Helm charts, you can use their local paths instead of a remote OCI registry to deploy ZenML Pro components. Here's an example of how to use them:

Infrastructure Requirements

To deploy the ZenML Pro control plane and one or more ZenML Pro workspace servers, ensure the following prerequisites are met:

  1. Kubernetes Cluster

    A functional Kubernetes cluster is required as the primary runtime environment.

  2. Database Server(s)

    The ZenML Pro Control Plane and ZenML Pro Workspace servers need to connect to an external database server. To minimize the amount of infrastructure resources needed, you can use a single database server in common for the Control Plane and for all workspaces, or you can use different database servers to ensure server-level database isolation, as long as you keep in mind the following limitations:

    • the ZenML Pro Control Plane can be connected to either MySQL or Postgres as the external database

    • the ZenML Pro Workspace servers can only be connected to a MySQL database (no Postgres support is available)

    • the ZenML Pro Control Plane as well as every ZenML Pro Workspace server needs to use its own individual database (especially important when connected to the same server)

    Ensure you have a valid username and password for the different ZenML Pro services. For improved security, it is recommended to have different users for different services. If the database user does not have permissions to create databases, you must also create a database and give the user full permissions to access and manage it (i.e. create, update and delete tables).

  3. Ingress Controller

    Install an Ingress provider in the cluster (e.g., NGINX, Traefik) to handle HTTP(S) traffic routing. Ensure the Ingress provider is properly configured to expose the cluster's services externally.

  4. Domain Name

    You'll need an FQDN for the ZenML Pro Control Plane as well as for every ZenML Pro workspace. For this reason, it's highly recommended to use a DNS prefix and associated SSL certificate instead of individual FQDNs and SSL certificates, to make this process easier.

    • FQDN or DNS Prefix Setup Obtain a Fully Qualified Domain Name (FQDN) or DNS prefix (e.g., *.zenml-pro.mydomain.com) from your DNS provider.

      • Identify the external Load Balancer IP address of the Ingress controller using the command kubectl get svc -n <ingress-namespace>. Look for the EXTERNAL-IP field of the Load Balancer service.

      • Create a DNS A record (or CNAME for subdomains) pointing the FQDN to the Load Balancer IP. Example:

        • Host: zenml-pro.mydomain.com

        • Type: A

        • Value: <Load Balancer IP>

      • Use a DNS propagation checker to confirm that the DNS record is resolving correctly.

  1. SSL Certificate

    The ZenML Pro services do not terminate SSL traffic. It is your responsibility to generate and configure the necessary SSL certificates for the ZenML Pro Control Plane as well as all the ZenML Pro workspaces that you will deploy (see the previous point on how to use a DNS prefix to make the process easier).

    • Obtaining SSL Certificates

      Acquire an SSL certificate for the domain. You can use:

      • A commercial SSL certificate provider (e.g., DigiCert, Sectigo).

      • Free services like Let's Encrypt for domain validation and issuance.

      • Self-signed certificates (not recommended for production environments). IMPORTANT: If you are using self-signed certificates, it is highly recommended to use the same self-signed CA certificate for all the ZenML Pro services (control plane and workspace servers), otherwise it will be difficult to manage the certificates on the client machines. With only one CA certificate, you can install it system-wide on all the client machines only once and then use it to sign all the TLS certificates for the ZenML Pro services.

    • Configuring SSL Termination

      Once the SSL certificate is obtained, configure your load balancer or Ingress controller to terminate HTTPS traffic:

      For NGINX Ingress Controller:

      You can configure SSL termination globally for the NGINX Ingress Controller by setting up a default SSL certificate or configuring it at the ingress controller level, or you can specify SSL certificates when configuring the ingress in the ZenML server Helm values.

      Here's how you can do it globally:

      1. Create a TLS Secret

        Store your SSL certificate and private key as a Kubernetes TLS secret in the namespace where the NGINX Ingress Controller is deployed.

      2. Update NGINX Ingress Controller Configurations

        Configure the NGINX Ingress Controller to use the default SSL certificate.

        • If using the NGINX Ingress Controller Helm chart, modify the values.yaml file or use -set during installation:

          Or directly pass the argument during Helm installation or upgrade:

        • If the NGINX Ingress Controller was installed manually, edit its deployment to include the argument in the args section of the container:

      For Traefik:

      • Configure Traefik to use TLS by creating a certificate resolver for Let's Encrypt or specifying the certificates manually in the traefik.yml or values.yaml file. Example for Let's Encrypt:

      • Reference the domain in your IngressRoute or Middleware configuration.

The above are infrastructure requirements for ZenML Pro. If, in addition to ZenML, you would also like to reuse the same Kubernetes cluster to run machine learning workloads with ZenML, you will require the following additional infrastructure resources and services to be able to set up a remote ZenML Stack:

Stage 1/2: Install the ZenML Pro Control Plane

Set up Credentials

If your Kubernetes cluster is not set to be authenticated to the container registry where the ZenML Pro container images are hosted, you will need to create a secret to allow the ZenML Pro server to pull the images. The following is an example of how to do this if you've received a private access key for the ZenML GCP Artifact Registry from ZenML, but you can use the same approach for your own private container registry:

The key.base64 file should contain the base64 encoded JSON key for the GCP service account as received from the ZenML support team. The image-pull-secret secret will be used in the next step when installing the ZenML Pro helm chart.

Configure the Helm Chart

There are a variety of options that can be configured for the ZenML Pro helm chart before installation.

You can take look at the Helm chart README and values.yaml file and familiarize yourself with some of the configuration settings that you can customize for your ZenML Pro deployment. Alternatively, you can unpack the README.md and values.yaml files included in the helm chart:

This is an example Helm values YAML file that covers the most common configuration options:

Minimum required settings:

  • the database credentials (zenml.database.external)

  • the URL (zenml.serverURL) and Ingress hostname (zenml.ingress.host) where the ZenML Pro Control Plane API and Dashboard will be reachable

In addition to the above, the following might also be relevant for you:

  • configure container registry credentials (imagePullSecrets)

  • injecting custom CA certificates (zenml.certificates), especially important if the TLS certificates used by the ZenML Pro services are signed by a custom Certificate Authority

  • configure HTTP proxy settings (zenml.proxy)

  • custom container image repository locations (zenml.image.api and zenml.image.dashboard)

  • the username and password used for the default admin account (zenml.auth.password)

  • additional Ingress settings (zenml.ingress)

  • Kubernetes resources allocated to the pods (resources)

  • If you set up a common DNS prefix that you plan on using for all the ZenML Pro services, you may configure the domain of the HTTP cookies used by the ZenML Pro dashboard to match it by setting zenml.auth.authCookieDomain to the DNS prefix (e.g. .my.domain instead of zenml-pro.my-domain)

Install the Helm Chart

Ensure that your Kubernetes cluster has access to all the container images. By default, the tags used for the container images are the same as the Helm chart version and it is recommended to keep them in sync, even though it is possible to override the tag values.

To install the helm chart (assuming the customized configuration values are in a my-values.yaml file), run:

If the installation is successful, you should be able to see the following workloads running in your cluster:

The Helm chart will output information explaining how to connect and authenticate to the ZenML Pro dashboard:

The credentials are for the default administrator user account provisioned on installation. With these on-hand, you can proceed to the next step and on-board additional users.

Install CA Certificates

If the TLS certificates used by the ZenML Pro services are signed by a custom Certificate Authority, you need to install the CA certificates on every machine that needs to access the ZenML server:

  • installing the CA certificates system-wide is usually the easiest solution. For example, on Ubuntu and Debian-based systems, you can install the CA certificates system-wide by copying the CA certificates into the /usr/local/share/ca-certificates directory and running update-ca-certificates.

  • for some browsers (e.g. Chrome), updating the system's CA certificates is not enough. You will also need to import the CA certificates into the browser.

  • for Python, you also need to set the REQUESTS_CA_BUNDLE environment variable to the path to the system's CA certificates bundle file (e.g. export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt)

  • later on, when you're running containerized pipelines with ZenML, you'll also want to install those same CA certificates into the container images built by ZenML by customizing the build process via DockerSettings. For example:

    • customize the ZenML client container image using a Dockerfile like this:

    • then build and push that image to your private container registry:

    • and finally update your ZenML pipeline code to use the custom ZenML client image by using the DockerSettings class:

Onboard Additional Users

Creating user accounts is not currently supported in the ZenML Pro dashboard, because this is not a typical ZenML Pro deployment used in production. A production ZenML Pro deployment should be configured to connect to an external OAuth 2.0 / OIDC identity provider.

However, this feature is currently supported with helper Python scripts, as described below.

  1. The deployed ZenML Pro service will come with a pre-installed default administrator account. This admin account serves the purpose of creating and recovering other users. First you will need to get the admin password following the instructions at the previous step.

  2. Create a users.yaml file that contains a list of all the users that you want to create for ZenML. Also set a default password. The users will be asked to change this password on their first login.

  3. Run the create_users.py script below. This will create all of the users.

    [file: create_users.py]

The script will prompt you for the URL of your deployment, the admin account username and password and finally the location of your users.yaml file.

Create an Organization

Head on over to your deployment in the browser and use one of the users you just created to log in.

After logging in for the first time, you will need to create a new password. (Be aware: For the time being only the admin account will be able to reset this password)

Finally you can create an Organization. This Organization will host all the workspaces you enroll at the next stage.

Invite Other Users to the Organization

Now you can invite your whole team to the org. For this open the drop-down in the top right and head over to the settings.

Here in the members tab, add all the users you created in the previous step. Make sure to assign the appropriate role to each user.

Finally, send the account's username and initial password over to your team members.

Stage 2/2: Enroll and Deploy ZenML Pro workspaces

Installing and updating on-prem ZenML Pro workspace servers is not automated, as it is with the SaaS version. You will be responsible for enrolling workspace servers in the right ZenML Pro organization, installing them and regularly updating them. Some scripts are provided to simplify this task as much as possible.

Enrolling a Workspace

  1. Run the enroll-workspace.py script below

    This will collect all the necessary data, then enroll the workspace in the organization and generate a Helm values.yaml file template that you can use to install the workspace server:

    [file: enroll-workspace.py]

    Running the script does two things:

    • it creates a workspace entry in the ZenML Pro database. The workspace will remain in a "provisioning" state and won't be accessible until you actually install it using Helm.

    • it outputs a YAML file with Helm chart configuration values that you can use to deploy the ZenML Pro workspace server in your Kubernetes cluster.

    This is an example of a generated Helm YAML file:

  2. Configure the ZenML Pro workspace Helm chart

    IMPORTANT: In configuring the ZenML Pro workspace Helm chart, keep the following in mind:

    • don't use the same database name for multiple workspaces

    • don't reuse the control plane database name for the workspace server database

    The ZenML Pro workspace server is nothing more than a slightly modified open-source ZenML server. The deployment even uses the official open-source helm chart.

    There are a variety of options that can be configured for the ZenML Pro workspace server chart before installation. You can start by taking a look at the Helm chart README and values.yaml file and familiarize yourself with some of the configuration settings that you can customize for your ZenML server deployment. Alternatively, you can unpack the README.md and values.yaml files included in the helm chart:

    To configure the Helm chart, use the generated YAML file generated at the previous step as a template and fill in the necessary values marked by TODO comments. At a minimum, you'll need to configure the following:

    • configure container registry credentials (imagePullSecrets, same as described for the control plane)

    • the MySQL database credentials (zenml.database.url)

    • the container image repository where the ZenML Pro workspace server container images are stored (zenml.image.repository)

    • the hostname where the ZenML Pro workspace server will be reachable (zenml.ingress.host and zenml.serverURL)

    You may also choose to configure additional features documented in the official OSS ZenML Helm deployment documentation pages, if you need them:

    • injecting custom CA certificates (zenml.certificates), especially important if the TLS certificate used for the ZenML Pro control plane is signed by a custom Certificate Authority

    • configure HTTP proxy settings (zenml.proxy)

    • set up secrets stores

    • configure database backup and restore

    • customize Kubernetes resources

    • etc.

  3. Deploy the ZenML Pro workspace server with Helm

    To install the helm chart (assuming the customized configuration values are in the generated zenml-my-workspace-values.yaml file), run e.g.:

    The deployment is ready when the ZenML server pod is running and healthy:

    After deployment, your workspace should show up as running in the ZenML Pro dashboard and can be accessed at the next step.

    If you need to deploy multiple workspaces, simply run the enrollment script again with different values.

Accessing the Workspace

If you use TLS certificates for the ZenML Pro control plane or workspace server signed by a custom Certificate Authority, remember to install them on the client machines.

Accessing the Workspace Dashboard

The newly enrolled workspace should be accessible in the ZenML Pro workspace dashboard and the CLI now. If you're the organization admin, you may also need to add other users as workspace members, if they don't have access to the workspace yet.

Then follow the instructions in the "Get Started" checklist to unlock the full dashboard:

Accessing the Workspace from the ZenML CLI

To login to the workspace with the ZenML CLI, you need to pass the custom ZenML Pro API URL to the zenml login command:

Alternatively, you can set the ZENML_PRO_API_URL environment variable:

Enabling Snapshot Support

The ZenML Pro workspace server can be configured to optionally support running pipeline snapshots straight from the dashboard. This feature is not enabled by default and needs a few additional steps to be set up.

Snapshots come with some optional sub-features that can be turned on or off to customize the behavior of the feature:

  • Building runner container images: Running pipelines from the dashboard relies on Kubernetes jobs (aka "runner" jobs) that are triggered by the ZenML workspace server. These jobs need to use container images that have the correct Python software packages installed on them to be able to launch the pipelines.

    The good news is that snapshots are based on pipeline runs that have already run in the past and already have container images built and associated with them. The same container images can be reused by the ZenML workspace server for the "runner jobs". However, for this to work, the Kubernetes cluster itself has to be able to access the container registries where these images are stored. This can be achieved in several ways:

    • use implicit workload identity access to the container registry - available in most cloud providers by granting the Kubernetes service account access to the container registry

    • configure a service account with implicit access to the container registry - associating some cloud service identity (e.g. a GCP service account, an AWS IAM role, etc.) with the Kubernetes service account used by the "runner" jobs

    • configure an image pull secret for the service account - similar to the previous option, but using a Kubernetes secret instead of a cloud service identity

    When none of the above are available or desirable, an alternative approach is to configure the ZenML workspace server itself to build these "runner" container images and push them to a different container registry. This can be achieved by setting the ZENML_KUBERNETES_WORKLOAD_MANAGER_BUILD_RUNNER_IMAGE environment variable to true and the ZENML_KUBERNETES_WORKLOAD_MANAGER_DOCKER_REGISTRY environment variable to the container registry where the "runner" images will be pushed.

    Yet another alternative is to configure the ZenML workspace server to use a single pre-built "runner" image for all the pipeline runs. This can be achieved by keeping ZENML_KUBERNETES_WORKLOAD_MANAGER_BUILD_RUNNER_IMAGE environment variable set to false and the ZENML_KUBERNETES_WORKLOAD_MANAGER_RUNNER_IMAGE environment variable set to the container image registry URI where the "runner" image is stored. Note that this image needs to have all requirements installed to instantiate the stack that will be used for the template run.

  • Store logs externally: By default, the ZenML workspace server will use the logs extracted from the "runner" job pods to populate the run template logs shown in the ZenML dashboard. These pods may disappear after a while, so the logs may not be available anymore.

    To avoid this, you can configure the ZenML workspace server to store the logs in an external location, like an S3 bucket. This can be achieved by setting the ZENML_KUBERNETES_WORKLOAD_MANAGER_ENABLE_EXTERNAL_LOGS environment variable to true.

    This option is only currently available with the AWS implementation of the snapshots feature and also requires the ZENML_AWS_KUBERNETES_WORKLOAD_MANAGER_BUCKET environment variable to be set to point to the S3 bucket where the logs will be stored.

  1. Decide on an implementation.

    There are currently three different implementations of the snapshots feature:

    • Kubernetes: runs pipelines in the same Kubernetes cluster as the ZenML Pro workspace server.

    • AWS: extends the Kubernetes implementation to be able to build and push container images to AWS ECR and to store run the template logs in AWS S3.

    • GCP: currently, this is the same as the Kubernetes implementation, but we plan to extend it to be able to push container images to GCP GCR and to store run template logs in GCP GCS.

    If you're going for a fast, minimalistic setup, you should go for the Kubernetes implementation. If you want a complete cloud provider solution with all features enabled, you should go for the AWS implementation.

  2. Prepare Snapshots configuration.

    You'll need to prepare a list of environment variables that will be added to the Helm chart values used to deploy the ZenML workspace server.

    For all implementations, the following variables are supported:

    • ZENML_SERVER_WORKLOAD_MANAGER_IMPLEMENTATION_SOURCE (mandatory): one of the values associated with the implementation you've chosen in step 1:

      • zenml_cloud_plugins.kubernetes_workload_manager.KubernetesWorkloadManager

      • zenml_cloud_plugins.aws_kubernetes_workload_manager.AWSKubernetesWorkloadManager

      • zenml_cloud_plugins.gcp_kubernetes_workload_manager.GCPKubernetesWorkloadManager

    • ZENML_KUBERNETES_WORKLOAD_MANAGER_NAMESPACE (mandatory): the Kubernetes namespace where the "runner" jobs will be launched. It must exist before the snapshots are enabled.

    • ZENML_KUBERNETES_WORKLOAD_MANAGER_SERVICE_ACCOUNT (mandatory): the Kubernetes service account to use for the "runner" jobs. It must exist before the snapshots are enabled.

    • ZENML_KUBERNETES_WORKLOAD_MANAGER_BUILD_RUNNER_IMAGE (optional): whether to build the "runner" container images or not. Defaults to false.

    • ZENML_KUBERNETES_WORKLOAD_MANAGER_DOCKER_REGISTRY (optional): the container registry where the "runner" images will be pushed. Mandatory if ZENML_KUBERNETES_WORKLOAD_MANAGER_BUILD_RUNNER_IMAGE is set to true, ignored otherwise.

    • ZENML_KUBERNETES_WORKLOAD_MANAGER_RUNNER_IMAGE (optional): the "runner" container image to use. Only used if ZENML_KUBERNETES_WORKLOAD_MANAGER_BUILD_RUNNER_IMAGE is set to false, ignored otherwise.

    • ZENML_KUBERNETES_WORKLOAD_MANAGER_ENABLE_EXTERNAL_LOGS (optional): whether to store the logs of the "runner" jobs in an external location. Defaults to false. Currently only supported with the AWS implementation and requires the ZENML_AWS_KUBERNETES_WORKLOAD_MANAGER_BUCKET variable to be set as well.

    • ZENML_KUBERNETES_WORKLOAD_MANAGER_POD_RESOURCES (optional): the Kubernetes pod resources specification to use for the "runner" jobs, in JSON format. Example: {"requests": {"cpu": "100m", "memory": "400Mi"}, "limits": {"memory": "700Mi"}}.

    • ZENML_KUBERNETES_WORKLOAD_MANAGER_TTL_SECONDS_AFTER_FINISHED (optional): the time in seconds after which to cleanup finished jobs and their pods. Defaults to 2 days.

    • ZENML_KUBERNETES_WORKLOAD_MANAGER_NODE_SELECTOR (optional): the Kubernetes node selector to use for the "runner" jobs, in JSON format. Example: {"node-pool": "zenml-pool"}.

    • ZENML_KUBERNETES_WORKLOAD_MANAGER_TOLERATIONS (optional): the Kubernetes tolerations to use for the "runner" jobs, in JSON format. Example: [{"key": "node-pool", "operator": "Equal", "value": "zenml-pool", "effect": "NoSchedule"}].

    • ZENML_KUBERNETES_WORKLOAD_MANAGER_JOB_BACKOFF_LIMIT (optional): the Kubernetes backoff limit to use for the builder and runner jobs.

    • ZENML_KUBERNETES_WORKLOAD_MANAGER_POD_FAILURE_POLICY (optional): the Kubernetes pod failure policy to use for the builder and runner jobs.

    • ZENML_SERVER_MAX_CONCURRENT_TEMPLATE_RUNS (optional): the maximum number of concurrent snapshot runs that can be started at the same time by each server container or pod. Defaults to 2. If a client exceeds this number, the request will be rejected with a 429 Too Many Requests HTTP error. Note that this only limits the number of parallel snapshots that can be started at the same time, not the number of parallel pipeline runs.

    For the AWS implementation, the following additional variables are supported:

    • ZENML_AWS_KUBERNETES_WORKLOAD_MANAGER_BUCKET (optional): the S3 bucket where the logs will be stored (e.g. s3://my-bucket/run-template-logs). Mandatory if ZENML_KUBERNETES_WORKLOAD_MANAGER_ENABLE_EXTERNAL_LOGS is set to true.

    • ZENML_AWS_KUBERNETES_WORKLOAD_MANAGER_REGION (optional): the AWS region where the container images will be pushed (e.g. eu-central-1). Mandatory if ZENML_KUBERNETES_WORKLOAD_MANAGER_BUILD_RUNNER_IMAGE is set to true.

  3. Create the Kubernetes resources.

    For the Kubernetes implementation, you'll need to create the following resources:

    • the Kubernetes namespace passed in the ZENML_KUBERNETES_WORKLOAD_MANAGER_NAMESPACE variable.

    • the Kubernetes service account passed in the ZENML_KUBERNETES_WORKLOAD_MANAGER_SERVICE_ACCOUNT variable. This service account will be used to build images and run the "runner" jobs, so it needs to have the necessary permissions to do so (e.g. access to the container images, permissions to push container images to the configured container registry, permissions to access the configured bucket, etc.).

  4. Finally, update the ZenML workspace server configuration to use the new implementation.

    The environment variables you prepared in step 2 need to be added to the Helm chart values used to deploy the ZenML workspace server and the ZenML server has to be updated as covered in the Day 2 Operations: Upgrades and Updates section.

    Example updated Helm values file (minimal configuration):

    Example updated Helm values file (full AWS configuration):

    Example updated Helm values file (full GCP configuration):

Day 2 Operations: Upgrades and Updates

This section covers how to upgrade or update your ZenML Pro deployment. The process involves updating both the ZenML Pro Control Plane and the ZenML Pro workspace servers.

Upgrade Checklist

  1. Check Available Versions and Release Notes

  2. Fetch and Prepare New Software Artifacts

    • Follow the Software Artifacts section to get access to the new versions of:

      • ZenML Pro Control Plane container images and Helm chart

      • ZenML Pro workspace server container images and Helm chart

    • If using a private registry, copy the new container images to your private registry

    • If you are using an air-gapped installation, follow the Air-Gapped Installation instructions

  3. Upgrade the ZenML Pro Control Plane

    • Option A - In-place upgrade with existing values. Use this if you don't need to change any configuration values as part of the upgrade:

    • Option B - Retrieve, modify and reapply values, if necessary. Use this if you need to change any configuration values as part of the upgrade or if you are performing a configuration update without upgrading the ZenML Pro Control Plane.

  4. Upgrade ZenML Pro Workspace Servers

    • For each workspace, perform either:

      • Option A - In-place upgrade with existing values. Use this if you don't need to change any configuration values as part of the upgrade:

      • Option B - Retrieve, modify and reapply values, if necessary. Use this if you need to change any configuration values as part of the upgrade or if you are performing a configuration update without upgrading the ZenML Pro Workspace Server.

ZenML Scarf

Last updated

Was this helpful?