Kubernetes with Helm
Deploy ZenML Pro Self-hosted on Kubernetes with Helm - complete self-hosted setup with no external dependencies.
This guide provides step-by-step instructions for deploying ZenML Pro in a fully air-gapped setup on Kubernetes using Helm charts. In an air-gapped deployment, all components run within your infrastructure with zero external dependencies.
Architecture Overview
All components run entirely within your Kubernetes cluster and infrastructure:

Prerequisites
Before starting, you need:
Infrastructure:
Kubernetes cluster (1.24+) within your air-gapped network
MySQL database (8.0+) for metadata storage (PostgreSQL also supported for control plane only)
Internal Docker registry (Harbor, Quay, Artifactory, etc.)
Load balancer or Ingress controller for HTTPS
NFS or object storage for artifacts (optional)
Network:
Internal DNS resolution
TLS certificates signed by your internal CA
Network connectivity between cluster components
Tools (on a machine with internet access for initial setup):
Docker
Helm (3.0+)
Access to pull ZenML Pro images from private registries (credentials from ZenML)
Step 1: Prepare Offline Artifacts
This step is performed on a machine with internet access, then transferred to your air-gapped environment.
1.1 Pull Container Images
On a machine with internet access and access to the ZenML Pro container registries:
Authenticate to the ZenML Pro container registries (AWS ECR or GCP Artifact Registry)
Use credentials provided by ZenML Support
Follow registry-specific authentication procedures
Pull all required images:
Pro Control Plane images (AWS ECR):
715803424590.dkr.ecr.eu-west-1.amazonaws.com/zenml-pro-api:<version>715803424590.dkr.ecr.eu-west-1.amazonaws.com/zenml-pro-dashboard:<version>
Pro Control Plane images (GCP Artifact Registry):
europe-west3-docker.pkg.dev/zenml-cloud/zenml-pro/zenml-pro-api:<version>europe-west3-docker.pkg.dev/zenml-cloud/zenml-pro/zenml-pro-dashboard:<version>
Workspace Server image (AWS ECR):
715803424590.dkr.ecr.eu-central-1.amazonaws.com/zenml-pro-server:<version>
Workspace Server image (GCP Artifact Registry):
europe-west3-docker.pkg.dev/zenml-cloud/zenml-pro/zenml-pro-server:<version>
Client image (for pipelines):
zenmldocker/zenml:<version>
Example pull commands (AWS ECR):
Example pull commands (GCP Artifact Registry):
Tag images with your internal registry:
Save images to tar files for transfer:
1.2 Download Helm Charts
On the same machine with internet access:
Pull the Helm charts:
ZenML Pro Control Plane:
oci://public.ecr.aws/zenml/zenml-proZenML Workspace Server:
oci://public.ecr.aws/zenml/zenml
Save charts as
.tgzfiles for transfer
1.3 Create Offline Bundle
Create a bundle containing all artifacts:
The manifest should document:
All image names and versions
Helm chart versions
Date of bundle creation
Required internal registry URLs
Step 2: Transfer to Air-gapped Environment
Transfer the bundle to your air-gapped environment using approved methods:
Physical media (USB drive, external drive)
Approved secure file transfer system
Air-gap transfer appliances
Any method compliant with your security policies
Step 3: Load Images into Internal Registry
In your air-gapped environment, load the images:
Extract all tar files:
Tag images for your internal registry:
Push images to your internal registry:
Step 4: Create Kubernetes Secrets
Step 5: Set Up Databases
Create database instances (within your air-gapped network):
Important Database Support:
Control Plane: Supports both PostgreSQL and MySQL
Workspace Servers: Only support MySQL (PostgreSQL is not supported)
Configuration:
Accessibility: Reachable from your Kubernetes cluster
Databases: At least 2 (one for control plane, one for workspace)
Users: Create dedicated database users with permissions
Backups: Configure automated backups to local storage
Monitoring: Enable local log aggregation
Connection strings needed for later:
Control Plane DB (PostgreSQL or MySQL):
postgresql://user:password@db-host:5432/zenml_proormysql://user:password@db-host:3306/zenml_proWorkspace DB (MySQL only):
mysql://user:password@db-host:3306/zenml_workspace
Step 6: Configure Helm Values for Control Plane
Create a file zenml-pro-values.yaml:
Step 7: Deploy ZenML Pro Control Plane
Using the local Helm chart:
Verify deployment:
Wait for all pods to be running and healthy.
Step 8: Enroll Workspace in Control Plane
Before deploying the workspace server, you must enroll it in the control plane to obtain the necessary enrollment credentials.
Access the Control Plane Dashboard
Navigate to
https://zenml-pro.internal.mycompany.comLog in with your admin credentials
Create an Organization (if not already created)
Go to Organization settings
Create a new organization or use an existing one
Note the Organization ID and Name
Enroll the Workspace
Use the enrollment script from the Self-hosted Deployment Guide or
Create a workspace through the dashboard and obtain:
Enrollment Key
Organization ID
Organization Name
Workspace ID
Workspace Name
Save these values - you'll need them in the next step
Step 9: Configure Helm Values for Workspace Server
Create a file zenml-workspace-values.yaml:
Step 10: Deploy ZenML Workspace Server
Verify deployment:
Step 11: Configure Internal DNS
Update your internal DNS to resolve:
zenml-pro.internal.mycompany.com→ Your ALB/Ingress IPzenml-workspace.internal.mycompany.com→ Your ALB/Ingress IP
Always use a fully qualified domain name (FQDN) (e.g. https://zenml.ml.cluster). Do not use a simple DNS prefix for the servers (e.g. https://zenml.cluster is not recommended). This is especially relevant for the TLS certificates that you prepare for these endpoints. The TLS certificates will not be accepted by some browsers (e.g. Chrome) otherwise.
Step 12: Install Internal CA Certificate
If the TLS certificates used by the ZenML Pro services are signed by a custom Certificate Authority, you need to install the CA certificates on every machine that needs to access the ZenML server.
System-wide Installation
On all client machines that will access ZenML:
Obtain your internal CA certificate
Install it in the system certificate store:
Linux: Copy to
/usr/local/share/ca-certificates/and runupdate-ca-certificatesmacOS: Use
sudo security add-trusted-cert -d -r trustRoot -k /Library/Keychains/System.keychain <cert.pem>Windows: Use
certutil -addstore "Root" cert.pem
For some browsers (e.g., Chrome), updating the system's CA certificates is not enough. You will also need to import the CA certificates into the browser.
For Python/ZenML client:
For Containerized Pipelines
When running containerized pipelines with ZenML, you'll need to install the CA certificates into the container images built by ZenML. Customize the build process via DockerSettings:
Create a custom Dockerfile:
Build and push the image to your internal registry:
Update your ZenML pipeline code to use the custom image:
Step 13: Verify the Deployment
Check Control Plane Health
Check Workspace Health
Access the Dashboard
Navigate to
https://zenml-pro.internal.mycompany.comin your browserLog in with admin credentials
Check Logs
Step 14: (Optional) Enable Snapshot Support / Workload Manager
Pipeline snapshots (running pipelines from the UI) requires additional configuration.
Snapshots are only available from ZenML workspace server version 0.90.0 onwards.
Understanding Snapshot Sub-features
Snapshots come with optional sub-features that can be turned on or off:
Building runner container images: Running pipelines from the UI relies on Kubernetes jobs ("runner" jobs) that need container images with the correct Python packages. You can:
Reuse existing pipeline container images (requires Kubernetes cluster access to those registries)
Have ZenML build "runner" images and push to a configured registry
Use a single pre-built "runner" image for all runs
Store logs externally: By default, logs are extracted from runner job pods. Since pods may disappear, you can configure external log storage (currently only supported with AWS implementation).
1. Create Kubernetes Resources for Workload Manager
Create a dedicated namespace and service account for runner jobs:
The service account needs permissions to build images and run jobs, including access to container images and any configured bucket for logs.
2. Choose Implementation
There are three available implementations:
Kubernetes: Runs pipelines in the same Kubernetes cluster as the ZenML Pro workspace server.
AWS: Extends Kubernetes implementation to build/push images to AWS ECR and store logs in AWS S3.
GCP: Currently same as Kubernetes, with plans to extend for GCP GCR and GCS support.
Option A: Kubernetes Implementation (Simplest)
Use the built-in Kubernetes implementation for running snapshots:
Option B: AWS Implementation (Full Featured)
For AWS-specific features including external logs and ECR integration:
Option C: GCP Implementation
For GCP environments:
3. Configure Runner Image
Choose how runner images are managed:
Option A: Use Pre-built Runner Image (Simpler for Air-gap)
Pre-build your runner image and push to your internal registry. Note that this image needs to have all requirements installed to instantiate the stack that will be used for the template run.
Option B: Have ZenML Build Runner Images
Requires access to internal Docker registry with push permissions:
4. Environment Variable Reference
All supported environment variables for workload manager configuration:
ZENML_SERVER_WORKLOAD_MANAGER_IMPLEMENTATION_SOURCE
Yes
Implementation class (see options above)
ZENML_KUBERNETES_WORKLOAD_MANAGER_NAMESPACE
Yes
Kubernetes namespace for runner jobs
ZENML_KUBERNETES_WORKLOAD_MANAGER_SERVICE_ACCOUNT
Yes
Kubernetes service account for runner jobs
ZENML_KUBERNETES_WORKLOAD_MANAGER_BUILD_RUNNER_IMAGE
No
Whether to build runner images (default: false)
ZENML_KUBERNETES_WORKLOAD_MANAGER_DOCKER_REGISTRY
Conditional
Registry for runner images (required if building images)
ZENML_KUBERNETES_WORKLOAD_MANAGER_RUNNER_IMAGE
No
Pre-built runner image (used if not building)
ZENML_KUBERNETES_WORKLOAD_MANAGER_ENABLE_EXTERNAL_LOGS
No
Store logs externally (default: false, AWS only)
ZENML_KUBERNETES_WORKLOAD_MANAGER_POD_RESOURCES
No
Pod resources in JSON format
ZENML_KUBERNETES_WORKLOAD_MANAGER_TTL_SECONDS_AFTER_FINISHED
No
Cleanup time for finished jobs (default: 2 days)
ZENML_KUBERNETES_WORKLOAD_MANAGER_NODE_SELECTOR
No
Node selector in JSON format
ZENML_KUBERNETES_WORKLOAD_MANAGER_TOLERATIONS
No
Tolerations in JSON format
ZENML_KUBERNETES_WORKLOAD_MANAGER_JOB_BACKOFF_LIMIT
No
Backoff limit for builder/runner jobs
ZENML_KUBERNETES_WORKLOAD_MANAGER_POD_FAILURE_POLICY
No
Pod failure policy for builder/runner jobs
ZENML_SERVER_MAX_CONCURRENT_TEMPLATE_RUNS
No
Max concurrent snapshot runs per pod (default: 2)
AWS-specific variables:
ZENML_AWS_KUBERNETES_WORKLOAD_MANAGER_BUCKET
Conditional
S3 bucket for logs (required if external logs enabled)
ZENML_AWS_KUBERNETES_WORKLOAD_MANAGER_REGION
Conditional
AWS region (required if building images)
5. Complete Configuration Examples
Minimal Kubernetes Configuration:
Full AWS Configuration:
Full GCP Configuration:
Air-gapped Configuration with Pre-built Runner:
6. Update Workspace Deployment
Update your workspace server Helm values with workload manager configuration and redeploy:
Step 15: Create Users and Organizations
In the ZenML Pro dashboard:
Create an organization
Create users for your team
Assign roles and permissions
Configure teams
Step 16: Access the Workspace from ZenML CLI
To login to the workspace with the ZenML CLI, you need to pass the custom ZenML Pro API URL:
Alternatively, you can set the ZENML_PRO_API_URL environment variable:
Network Requirements Summary
Web Access
Client Machines
Ingress Controller
443
Inbound
API Access
ZenML Client
Workspace Server
443
Inbound
Database
Kubernetes Pods
MySQL
3306
Outbound
Registry
Kubernetes
Internal Registry
443
Outbound
Inter-service
Kubernetes Internal
Kubernetes Services
443
Internal
Scaling & High Availability
Multiple Control Plane Replicas
Multiple Workspace Replicas
Database Replication
For HA, configure MySQL replication:
Set up a standby database
Configure binary log replication
Test failover procedures
Backup & Recovery
Automated Backups
Configure automated MySQL backups:
Frequency: Daily or more frequent
Retention: 30+ days
Location: Internal storage (not external)
Testing: Test restore procedures regularly
Backup Checklist
Database backups (automated)
Configuration backups (values.yaml files, versioned)
TLS certificates (secure storage)
Custom CA certificate (backup copy)
Helm chart versions (archived)
Recovery Procedure
Documented recovery procedure should cover:
Database restoration steps
Helm redeployment steps
Data validation after restore
User communication plan
Monitoring & Logging
Internal Monitoring
Set up internal monitoring for:
CPU and memory usage
Pod restart count
Database connection count
Ingress error rates
Certificate expiration dates
Log Aggregation
Forward logs to your internal log aggregation system:
Application logs from ZenML pods
Ingress logs
Database logs
Kubernetes events
Alerting
Create alerts for:
Pod failures
High resource usage
Database connection errors
Certificate near expiration
Disk space warnings
Maintenance
Regular Tasks
Monitor disk space (databases, artifact storage)
Review and manage user access
Update internal CA certificate before expiration
Test backup and recovery procedures
Monitor pod logs for warnings
Periodic Updates
When updating to a new ZenML version:
Pull new images on internet-connected machine
Push to internal registry
Create new offline bundle with updated Helm charts
Transfer bundle to air-gapped environment
Update Helm charts in air-gapped environment
Update image tags in values.yaml
Perform helm upgrade on control plane
Perform helm upgrade on workspace servers
Verify health after upgrade
Update client images in your custom ZenML container
Troubleshooting
Pods Won't Start
Check pod logs and events:
Common issues:
Image pull failures (check registry access)
Database connectivity (verify connection string)
Certificate issues (verify CA is trusted)
Database Connection Failed
Can't Access via HTTPS
Verify certificate validity
Verify DNS resolution
Check Ingress status
Verify CA certificate is installed on client
Image Pull Errors
Verify images are in internal registry
Check registry credentials in secret
Verify imagePullSecrets configured correctly
Day 2 Operations
For information on upgrading ZenML Pro components, see the Upgrades & Updates guide.
Related Resources
Self-hosted Deployment Guide - Comprehensive deployment reference
Support
For air-gapped deployments, contact ZenML Support:
Email: [email protected]
Provide: Your offline bundle, deployment status, and any error logs
Request from ZenML Support:
Pre-deployment architecture consultation
Offline support packages
Update bundles and release notes
Security documentation (SBOM, vulnerability reports)
Last updated
Was this helpful?