Enable Snapshot Support

Enable snapshot support for self-hosted ZenML Pro workspaces

The Workspace Server includes a Workload Manager feature that allows running pipelines directly from the ZenML Pro UI. This feature requires access to a Kubernetes cluster where ad-hoc pipeline runner pods can be created.

Snapshots are only available from ZenML Pro Workspace Server version 0.90.0 onwards.

Snapshot support is only available for workspace servers that are deployed on Kubernetes. Workspace servers that are deployed on AWS ECS or other platforms are currently not supported.

Prerequisites

Basic requirements:

Kubernetes cluster (1.24+) accessible from the workspace server
Dedicated namespace for runner pods
Service accounts with RBAC permissions to create/manage pods
Image pull secrets for the service accounts

Understanding Workload Manager Sub-features

Running pipelines from the UI relies on running Kubernetes jobs (aka "runner" jobs) that are responsible for launching the pipelines in the same manner as when running them from the CLI or SDK. These jobs need to use container images with the correct Python package dependencies to be able to launch the pipelines. There are several ways to achieve this and you'll need to choose the one that best fits your needs:

Reuse snapshot container images: the same pipeline container images that are built for the snapshot being run can also be used for the "runner" jobs. For this to work, you have to grant the "runner" jobs pull access to all container registries where these images are stored (i.e. the Container Registries used in your ZenML Stacks). This option allows running only snapshots associated with stacks that include the same container registry.
Build "runner" container images on-demand: in this variant, the Workspace Server will launch additional Kubernetes jobs to build the "runner" images when needed and push them to a configured container registry. It requires these "builder" Kubernetes jobs to have push permissions to a private container registry and the "runner" jobs to have pull access to the same container registry. This option is the most flexible and allows running snapshots associated with any stack and any integration.
Use pre-built "runner" image: you can provide a single pre-built "runner" image (stored by you in a container registry) for all runs. This is the simplest and fastest option, but you have to ensure that the image has all the correct Python package dependencies to be able to launch the pipelines. This option is the most limited and requires you to pre-build a container image that contains all the dependencies for all possible stacks and integrations used by your pipelines.

Store logs externally: By default, logs shown in the ZenML Pro UI are extracted from the "runner" job pods. Since pods may disappear, you can configure external log storage where these logs will be stored. Currently, this is only supported with the AWS implementation. If you enable this, you need to configure the S3 bucket and region where the logs will be stored and grant the ZenML Pro Workspace Server pods write access to this bucket.

There are three available Workload Manager implementations:

Kubernetes: Runs pipelines in the same Kubernetes cluster as the ZenML Pro workspace server.
AWS: Extends the Kubernetes implementation to build/push images to AWS ECR and store logs in AWS S3.

1. Create Kubernetes Resources for the Workload Manager

Create a dedicated namespace and service account where the runner jobs will be launched.

# Create namespace
kubectl create namespace zenml-workload-manager

# Create service account
kubectl -n zenml-workload-manager create serviceaccount zenml-workload-manager

2. Choose Implementation

Your choice of implementation will determine the additional environment variables you need to configure in the ZenML Workspace Server Helm deployment:

Option A: Kubernetes Implementation (Basic)

Provides generic Kubernetes functionality to run snapshots.

zenml:
    environment:
        ZENML_SERVER_WORKLOAD_MANAGER_IMPLEMENTATION_SOURCE: zenml_cloud_plugins.kubernetes_workload_manager.KubernetesWorkloadManager
        ZENML_KUBERNETES_WORKLOAD_MANAGER_NAMESPACE: zenml-workload-manager
        ZENML_KUBERNETES_WORKLOAD_MANAGER_SERVICE_ACCOUNT: zenml-workload-manager

Option B: AWS Implementation

Provides AWS-specific features including external S3 logs and ECR integration.

zenml:
    environment:
        ZENML_SERVER_WORKLOAD_MANAGER_IMPLEMENTATION_SOURCE: zenml_cloud_plugins.aws_kubernetes_workload_manager.AWSKubernetesWorkloadManager
        ZENML_KUBERNETES_WORKLOAD_MANAGER_NAMESPACE: zenml-workload-manager
        ZENML_KUBERNETES_WORKLOAD_MANAGER_SERVICE_ACCOUNT: zenml-workload-manager
        ZENML_AWS_KUBERNETES_WORKLOAD_MANAGER_REGION: eu-central-1
        # To enable storing logs externally in S3, also set the following environment variables:
        ZENML_KUBERNETES_WORKLOAD_MANAGER_ENABLE_EXTERNAL_LOGS: "true"
        ZENML_AWS_KUBERNETES_WORKLOAD_MANAGER_BUCKET: s3://my-bucket/run-template-logs

3. Configure the Runner Image Source

Choose how runner images are managed. Your choice of implementation will determine the additional environment variables you need to configure in the ZenML Workspace Server Helm deployment:

Option 1: Reuse snapshot container images

Reuse the container images built for the snapshot being run.

zenml:
    environment:
        ZENML_KUBERNETES_WORKLOAD_MANAGER_BUILD_RUNNER_IMAGE: "false"
        # Keep this empty or skip setting it to reuse the snapshot container images
        # ZENML_KUBERNETES_WORKLOAD_MANAGER_RUNNER_IMAGE:

Option 2: Have ZenML Build Runner Images

Build the runner images on-demand.

zenml:
    environment:
        ZENML_KUBERNETES_WORKLOAD_MANAGER_BUILD_RUNNER_IMAGE: "true"
        ZENML_KUBERNETES_WORKLOAD_MANAGER_DOCKER_REGISTRY: internal-registry.mycompany.com/zenml

Option 3: Use a Pre-built Runner Image

Use a pre-built runner image for all runs.

zenml:
    environment:
        ZENML_KUBERNETES_WORKLOAD_MANAGER_BUILD_RUNNER_IMAGE: "false"
        ZENML_KUBERNETES_WORKLOAD_MANAGER_RUNNER_IMAGE: internal-registry.mycompany.com/zenml/zenml:<ZENML_OSS_VERSION>

4. Configure Permissions

The Kubernetes service account running the ZenML Workspace Server needs additional permissions:

permissions to create and manage jobs in the workload manager Kubernetes namespace set up in step 1.
if the AWS implementation is used and external S3 logs are enabled, permissions to write to the configured S3 bucket.

The workload manager Kubernetes service account set up in step 1 also needs the following container registry permissions:

permissions to pull images from the container registry where the runner images are stored.
if the option to build runner images on-demand is chosen, permissions to push images to the container registry where the runner images will be pushed.

Granting these permissions can be achieved in several ways:

grant the entire cluster access to the container registry
use implicit workload identity access to the container registry - available in most cloud providers by granting the Kubernetes service account access to the container registry
configure a service account with implicit access to the container registry - associating some cloud service identity (e.g. a GCP service account, an AWS IAM role, etc.) with the Kubernetes service account
configure an image pull secret for the service account - similar to the previous option, but using a Kubernetes secret instead of a cloud service identity

4. Environment Variable Reference

All supported environment variables for workload manager configuration:

Variable

Required

Description

ZENML_SERVER_WORKLOAD_MANAGER_IMPLEMENTATION_SOURCE

Yes

Implementation class (see options above)

ZENML_KUBERNETES_WORKLOAD_MANAGER_NAMESPACE

Yes

Kubernetes namespace for runner jobs

ZENML_KUBERNETES_WORKLOAD_MANAGER_SERVICE_ACCOUNT

Yes

Kubernetes service account for runner jobs

ZENML_KUBERNETES_WORKLOAD_MANAGER_BUILD_RUNNER_IMAGE

Whether to build runner images (default: false)

ZENML_KUBERNETES_WORKLOAD_MANAGER_DOCKER_REGISTRY

Conditional

Registry for runner images (required if building images)

ZENML_KUBERNETES_WORKLOAD_MANAGER_RUNNER_IMAGE

Pre-built runner image (used if not building)

ZENML_KUBERNETES_WORKLOAD_MANAGER_ENABLE_EXTERNAL_LOGS

Store logs externally (default: false, AWS only)

ZENML_KUBERNETES_WORKLOAD_MANAGER_POD_RESOURCES

Pod resources in JSON format

ZENML_KUBERNETES_WORKLOAD_MANAGER_TTL_SECONDS_AFTER_FINISHED

Cleanup time for finished jobs (default: 2 days)

ZENML_KUBERNETES_WORKLOAD_MANAGER_NODE_SELECTOR

Node selector in JSON format

ZENML_KUBERNETES_WORKLOAD_MANAGER_TOLERATIONS

Tolerations in JSON format

ZENML_KUBERNETES_WORKLOAD_MANAGER_JOB_BACKOFF_LIMIT

Backoff limit for builder/runner jobs

ZENML_KUBERNETES_WORKLOAD_MANAGER_POD_FAILURE_POLICY

Pod failure policy for builder/runner jobs

ZENML_SERVER_MAX_CONCURRENT_TEMPLATE_RUNS

Max concurrent snapshot runs per pod (default: 2)

AWS-specific variables:

Variable

Required

Description

ZENML_AWS_KUBERNETES_WORKLOAD_MANAGER_BUCKET

Conditional

S3 bucket for logs (required if external logs enabled)

ZENML_AWS_KUBERNETES_WORKLOAD_MANAGER_REGION

Conditional

AWS region (required if building images)

5. Complete Configuration Examples

Minimal Kubernetes Configuration:

zenml:
    environment:
        ZENML_SERVER_WORKLOAD_MANAGER_IMPLEMENTATION_SOURCE: zenml_cloud_plugins.kubernetes_workload_manager.KubernetesWorkloadManager
        ZENML_KUBERNETES_WORKLOAD_MANAGER_NAMESPACE: zenml-workspace-namespace
        ZENML_KUBERNETES_WORKLOAD_MANAGER_SERVICE_ACCOUNT: zenml-workspace-service-account

Full AWS Configuration:

zenml:
    environment:
        ZENML_SERVER_WORKLOAD_MANAGER_IMPLEMENTATION_SOURCE: zenml_cloud_plugins.aws_kubernetes_workload_manager.AWSKubernetesWorkloadManager
        ZENML_KUBERNETES_WORKLOAD_MANAGER_NAMESPACE: zenml-workspace-namespace
        ZENML_KUBERNETES_WORKLOAD_MANAGER_SERVICE_ACCOUNT: zenml-workspace-service-account
        ZENML_KUBERNETES_WORKLOAD_MANAGER_BUILD_RUNNER_IMAGE: "true"
        ZENML_KUBERNETES_WORKLOAD_MANAGER_DOCKER_REGISTRY: 339712793861.dkr.ecr.eu-central-1.amazonaws.com
        ZENML_KUBERNETES_WORKLOAD_MANAGER_ENABLE_EXTERNAL_LOGS: "true"
        ZENML_KUBERNETES_WORKLOAD_MANAGER_POD_RESOURCES: '{"requests": {"cpu": "100m", "memory": "400Mi"}, "limits": {"memory": "700Mi"}}'
        ZENML_AWS_KUBERNETES_WORKLOAD_MANAGER_BUCKET: s3://my-bucket/run-template-logs
        ZENML_AWS_KUBERNETES_WORKLOAD_MANAGER_REGION: eu-central-1
        ZENML_KUBERNETES_WORKLOAD_MANAGER_NODE_SELECTOR: '{"node-pool": "zenml-pool"}'
        ZENML_KUBERNETES_WORKLOAD_MANAGER_TOLERATIONS: '[{"key": "node-pool", "operator": "Equal", "value": "zenml-pool", "effect": "NoSchedule"}]'
        ZENML_SERVER_MAX_CONCURRENT_TEMPLATE_RUNS: 10

Configuration with a Pre-built Runner Image:

zenml:
    environment:
        ZENML_SERVER_WORKLOAD_MANAGER_IMPLEMENTATION_SOURCE: zenml_cloud_plugins.kubernetes_workload_manager.KubernetesWorkloadManager
        ZENML_KUBERNETES_WORKLOAD_MANAGER_NAMESPACE: zenml-workspace-namespace
        ZENML_KUBERNETES_WORKLOAD_MANAGER_SERVICE_ACCOUNT: zenml-workspace-service-account
        ZENML_KUBERNETES_WORKLOAD_MANAGER_BUILD_RUNNER_IMAGE: "false"
        ZENML_KUBERNETES_WORKLOAD_MANAGER_RUNNER_IMAGE: internal-registry.mycompany.com/zenml/zenml:<ZENML_OSS_VERSION>
        ZENML_KUBERNETES_WORKLOAD_MANAGER_POD_RESOURCES: '{"requests": {"cpu": "100m", "memory": "400Mi"}, "limits": {"memory": "700Mi"}}'
        ZENML_KUBERNETES_WORKLOAD_MANAGER_TTL_SECONDS_AFTER_FINISHED: 86400
        ZENML_SERVER_MAX_CONCURRENT_TEMPLATE_RUNS: 2

6. Update Workspace Deployment

Update your workspace server Helm values with workload manager configuration and redeploy:

helm upgrade zenml ./zenml-<version>.tgz \
  --namespace zenml-workspace \
  --values zenml-workspace-values.yaml

PreviousAWS ECS NextSingle Sign-On (SSO)

Last updated 20 days ago

Was this helpful?

Good evening

hashtagPrerequisites

hashtagUnderstanding Workload Manager Sub-features

hashtag1. Create Kubernetes Resources for the Workload Manager

hashtag2. Choose Implementation

hashtag3. Configure the Runner Image Source

hashtag4. Configure Permissions

hashtag4. Environment Variable Reference

hashtag5. Complete Configuration Examples

hashtag6. Update Workspace Deployment

hashtagRelated Resources

Prerequisites

Understanding Workload Manager Sub-features

1. Create Kubernetes Resources for the Workload Manager

2. Choose Implementation

3. Configure the Runner Image Source

4. Configure Permissions

4. Environment Variable Reference

5. Complete Configuration Examples

6. Update Workspace Deployment

Related Resources