Enable Snapshot Support
Enable snapshot support for self-hosted ZenML Pro workspaces
The Workspace Server includes a Workload Manager feature that allows running pipelines directly from the ZenML Pro UI. This feature requires access to a Kubernetes cluster where ad-hoc pipeline runner pods can be created.
Snapshots are only available from ZenML Pro Workspace Server version 0.90.0 onwards.
Snapshot support is only available for workspace servers that are deployed on Kubernetes. Workspace servers that are deployed on AWS ECS or other platforms are currently not supported.
Prerequisites
Basic requirements:
Kubernetes cluster (1.24+) accessible from the workspace server
Dedicated namespace for runner pods
Service accounts with RBAC permissions to create/manage pods
Image pull secrets for the service accounts
Understanding Workload Manager Sub-features
Running pipelines from the UI relies on running Kubernetes jobs (aka "runner" jobs) that are responsible for launching the pipelines in the same manner as when running them from the CLI or SDK. These jobs need to use container images with the correct Python package dependencies to be able to launch the pipelines. There are several ways to achieve this and you'll need to choose the one that best fits your needs:
Reuse snapshot container images: the same pipeline container images that are built for the snapshot being run can also be used for the "runner" jobs. For this to work, you have to grant the "runner" jobs pull access to all container registries where these images are stored (i.e. the Container Registries used in your ZenML Stacks). This option allows running only snapshots associated with stacks that include the same container registry.
Build "runner" container images on-demand: in this variant, the Workspace Server will launch additional Kubernetes jobs to build the "runner" images when needed and push them to a configured container registry. It requires these "builder" Kubernetes jobs to have push permissions to a private container registry and the "runner" jobs to have pull access to the same container registry. This option is the most flexible and allows running snapshots associated with any stack and any integration.
Use pre-built "runner" image: you can provide a single pre-built "runner" image (stored by you in a container registry) for all runs. This is the simplest and fastest option, but you have to ensure that the image has all the correct Python package dependencies to be able to launch the pipelines. This option is the most limited and requires you to pre-build a container image that contains all the dependencies for all possible stacks and integrations used by your pipelines.
Store logs externally: By default, logs shown in the ZenML Pro UI are extracted from the "runner" job pods. Since pods may disappear, you can configure external log storage where these logs will be stored. Currently, this is only supported with the AWS implementation. If you enable this, you need to configure the S3 bucket and region where the logs will be stored and grant the ZenML Pro Workspace Server pods write access to this bucket.
There are three available Workload Manager implementations:
Kubernetes: Runs pipelines in the same Kubernetes cluster as the ZenML Pro workspace server.
AWS: Extends the Kubernetes implementation to build/push images to AWS ECR and store logs in AWS S3.
1. Create Kubernetes Resources for the Workload Manager
Create a dedicated namespace and service account where the runner jobs will be launched.
2. Choose Implementation
Your choice of implementation will determine the additional environment variables you need to configure in the ZenML Workspace Server Helm deployment:
Option A: Kubernetes Implementation (Basic)
Provides generic Kubernetes functionality to run snapshots.
Option B: AWS Implementation
Provides AWS-specific features including external S3 logs and ECR integration.
3. Configure the Runner Image Source
Choose how runner images are managed. Your choice of implementation will determine the additional environment variables you need to configure in the ZenML Workspace Server Helm deployment:
Option 1: Reuse snapshot container images
Reuse the container images built for the snapshot being run.
Option 2: Have ZenML Build Runner Images
Build the runner images on-demand.
Option 3: Use a Pre-built Runner Image
Use a pre-built runner image for all runs.
4. Configure Permissions
The Kubernetes service account running the ZenML Workspace Server needs additional permissions:
permissions to create and manage jobs in the workload manager Kubernetes namespace set up in step 1.
if the AWS implementation is used and external S3 logs are enabled, permissions to write to the configured S3 bucket.
The workload manager Kubernetes service account set up in step 1 also needs the following container registry permissions:
permissions to pull images from the container registry where the runner images are stored.
if the option to build runner images on-demand is chosen, permissions to push images to the container registry where the runner images will be pushed.
Granting these permissions can be achieved in several ways:
grant the entire cluster access to the container registry
use implicit workload identity access to the container registry - available in most cloud providers by granting the Kubernetes service account access to the container registry
configure a service account with implicit access to the container registry - associating some cloud service identity (e.g. a GCP service account, an AWS IAM role, etc.) with the Kubernetes service account
configure an image pull secret for the service account - similar to the previous option, but using a Kubernetes secret instead of a cloud service identity
4. Environment Variable Reference
All supported environment variables for workload manager configuration:
ZENML_SERVER_WORKLOAD_MANAGER_IMPLEMENTATION_SOURCE
Yes
Implementation class (see options above)
ZENML_KUBERNETES_WORKLOAD_MANAGER_NAMESPACE
Yes
Kubernetes namespace for runner jobs
ZENML_KUBERNETES_WORKLOAD_MANAGER_SERVICE_ACCOUNT
Yes
Kubernetes service account for runner jobs
ZENML_KUBERNETES_WORKLOAD_MANAGER_BUILD_RUNNER_IMAGE
No
Whether to build runner images (default: false)
ZENML_KUBERNETES_WORKLOAD_MANAGER_DOCKER_REGISTRY
Conditional
Registry for runner images (required if building images)
ZENML_KUBERNETES_WORKLOAD_MANAGER_RUNNER_IMAGE
No
Pre-built runner image (used if not building)
ZENML_KUBERNETES_WORKLOAD_MANAGER_ENABLE_EXTERNAL_LOGS
No
Store logs externally (default: false, AWS only)
ZENML_KUBERNETES_WORKLOAD_MANAGER_POD_RESOURCES
No
Pod resources in JSON format
ZENML_KUBERNETES_WORKLOAD_MANAGER_TTL_SECONDS_AFTER_FINISHED
No
Cleanup time for finished jobs (default: 2 days)
ZENML_KUBERNETES_WORKLOAD_MANAGER_NODE_SELECTOR
No
Node selector in JSON format
ZENML_KUBERNETES_WORKLOAD_MANAGER_TOLERATIONS
No
Tolerations in JSON format
ZENML_KUBERNETES_WORKLOAD_MANAGER_JOB_BACKOFF_LIMIT
No
Backoff limit for builder/runner jobs
ZENML_KUBERNETES_WORKLOAD_MANAGER_POD_FAILURE_POLICY
No
Pod failure policy for builder/runner jobs
ZENML_SERVER_MAX_CONCURRENT_TEMPLATE_RUNS
No
Max concurrent snapshot runs per pod (default: 2)
AWS-specific variables:
ZENML_AWS_KUBERNETES_WORKLOAD_MANAGER_BUCKET
Conditional
S3 bucket for logs (required if external logs enabled)
ZENML_AWS_KUBERNETES_WORKLOAD_MANAGER_REGION
Conditional
AWS region (required if building images)
5. Complete Configuration Examples
Minimal Kubernetes Configuration:
Full AWS Configuration:
Configuration with a Pre-built Runner Image:
6. Update Workspace Deployment
Update your workspace server Helm values with workload manager configuration and redeploy:
Related Resources
Last updated
Was this helpful?