AWS ECS
Deploy ZenML Pro Hybrid on AWS ECS with a managed control plane.
This guide provides high-level instructions for deploying ZenML Pro in a Hybrid setup on AWS ECS (Elastic Container Service).
Architecture Overview
In this setup:
ZenML workspace runs in ECS tasks within your VPC
Load balancer handles HTTPS traffic and routes to ECS tasks
Database stores workspace metadata in AWS RDS
Secrets manager stores Pro credentials securely
NAT gateway enables outbound access to ZenML Cloud control plane
Prerequisites
Before starting, complete the setup described in Hybrid Deployment Overview:
Step 1: Set up ZenML Pro organization
Step 2: Configure your infrastructure (database, networking, TLS)
Step 3: Obtain Pro credentials from ZenML Support
You'll also need:
AWS Account with appropriate IAM permissions
Basic familiarity with AWS ECS, VPC, and RDS
Step 1: Set Up AWS Infrastructure
VPC and Subnets
Create a VPC with:
Public subnets (at least 2 across different availability zones) - for the Application Load Balancer
Private subnets (at least 2 across different availability zones) - for ECS tasks and RDS
Security Groups
Create three security groups:
ALB Security Group
Inbound: HTTPS (443) and HTTP (80) from
0.0.0.0/0Outbound: HTTP (8000) to the ECS security group
ECS Security Group
Inbound: HTTP (8000) from the ALB security group
Outbound: HTTPS (443) to
0.0.0.0/0(for ZenML Cloud access)Outbound: TCP (3306 for MySQL) to the RDS security group
RDS Security Group
Inbound: TCP (3306 for MySQL) from the ECS security group
Outbound: Not restricted
NAT Gateway
To enable ECS tasks to reach ZenML Cloud:
Create an Elastic IP in your AWS region
Create a NAT Gateway in one of your public subnets
Wait for the NAT Gateway to be available
Route Tables
For your private subnets (where ECS tasks run):
Create a route table
Add a default route (
0.0.0.0/0) pointing to the NAT GatewayAssociate this route table with your private subnets
Step 2: Set Up RDS Database
Create an RDS database instance. Important: Workspace servers only support MySQL, not PostgreSQL.
Configuration:
DB Engine: MySQL 8.0+ (PostgreSQL is not supported for workspace servers)
Instance Class:
db.t3.microor larger depending on expected loadStorage: 100 GB initial (with automatic scaling enabled)
Multi-AZ: Enable for production deployments
VPC: Your ZenML VPC
Subnet Group: Create a DB subnet group with your private subnets
Security Group: RDS security group created above
Backups: 30 days retention minimum
Logs: Enable error, general, and slowquery logs to CloudWatch
After creation:
Note the database endpoint (hostname)
Create the initial database:
zenml_hybridCreate a database user with full permissions on the database
Step 3: Store Secrets in AWS Secrets Manager
Store your Pro credentials securely:
OAuth2 Client Secret
Secret name:
zenml/pro/oauth2-client-secretValue: Your
ZENML_SERVER_PRO_OAUTH2_CLIENT_SECRETfrom ZenML
(Optional) Database Password
Secret name:
zenml/rds/passwordValue: Your RDS database password
Note the ARN of your OAuth2 secret - you'll reference it in the task definition.
Step 4: Create ECS IAM Roles
Create two IAM roles:
Task Execution Role
This role allows ECS to pull images and manage logs:
Attach:
AmazonECSTaskExecutionRolePolicyAdd inline policy for Secrets Manager access:
Action:
secretsmanager:GetSecretValueResource: Your OAuth2 secret ARN
Action:
logs:CreateLogGroup,logs:CreateLogStream,logs:PutLogEventsResource: Your CloudWatch log group
Task Role
This role is for application-level permissions (optional for basic setup):
Leave empty for now, or add policies if your tasks need to access other AWS services
Step 5: Create ECS Task Definition
In the AWS Console or using AWS CLI/Terraform, create a task definition with:
Task Configuration:
Compatibility: FARGATE
CPU: 512 (0.5 vCPU)
Memory: 1024 MB
Network Mode: awsvpc
Execution Role: Task execution role created above
Task Role: Task role created above
Container Configuration:
Image:
715803424590.dkr.ecr.eu-central-1.amazonaws.com/zenml-pro-server:<ZENML_OSS_VERSION>Port Mapping: Container port 8000 to port 8000
Essential: Yes
Environment Variables:
Set these in the task definition:
ZENML_SERVER_DEPLOYMENT_TYPE
cloud
ZENML_SERVER_PRO_API_URL
https://cloudapi.zenml.io
ZENML_SERVER_PRO_DASHBOARD_URL
https://cloud.zenml.io
ZENML_SERVER_PRO_ORGANIZATION_ID
Your organization ID from Step 1
ZENML_SERVER_PRO_ORGANIZATION_NAME
Your organization name from Step 1
ZENML_SERVER_PRO_WORKSPACE_ID
From ZenML Support
ZENML_SERVER_PRO_WORKSPACE_NAME
Your workspace name
ZENML_SERVER_PRO_OAUTH2_AUDIENCE
https://cloudapi.zenml.io
ZENML_SERVER_SERVER_URL
https://zenml.mycompany.com
ZENML_DATABASE_URL
mysql://user:password@hostname:3306/zenml_hybrid (MySQL only - PostgreSQL not supported)
ZENML_SERVER_HOSTNAME
0.0.0.0
ZENML_SERVER_PORT
8000
ZENML_LOGGING_LEVEL
INFO
Secrets:
Reference your secret from Secrets Manager:
ZENML_SERVER_PRO_OAUTH2_CLIENT_SECRET
arn:aws:secretsmanager:region:account:secret:zenml/pro/oauth2-client-secret
Logging:
Configure CloudWatch logs:
Log Group:
/ecs/zenml-hybridLog Stream Prefix:
ecsRegion: Your AWS region
Step 6: Create ECS Cluster and Service
Create an ECS cluster named zenml-hybrid.
Then create an ECS service within this cluster:
Service Configuration:
Cluster: zenml-hybrid
Task Definition: zenml-hybrid (latest version)
Launch Type: FARGATE
Desired Count: 1 (or more for high availability)
Platform Version: LATEST
Network Configuration:
VPC: Your ZenML VPC
Subnets: Your private subnets
Security Group: ECS security group
Public IP: Disabled (tasks don't need public IPs)
Load Balancing:
Load Balancer Type: Application Load Balancer
Container: zenml-server
Container Port: 8000
(Leave the target group selection for the next step)
Step 7: Set Up Application Load Balancer
Create an Application Load Balancer (ALB):
Configuration:
Subnets: Your public subnets
Security Group: ALB security group
Target Group
Create a target group for your ECS service:
Health Check Configuration:
Protocol: HTTP
Path:
/healthPort: 8000
Interval: 30 seconds
Timeout: 5 seconds
Healthy Threshold: 2
Unhealthy Threshold: 3
Listeners
Create two listeners on your ALB:
HTTPS Listener (Port 443)
Certificate: Your TLS certificate from ACM or imported
Default Action: Forward to your target group
HTTP Listener (Port 80)
Default Action: Redirect to HTTPS (port 443)
Step 8: Configure DNS
In your DNS provider (Route 53 or external):
Create an A record (or CNAME) pointing to your ALB's DNS name
Name:
zenml.mycompany.comTarget: Your ALB's DNS name or IP
Type: A record (use Alias if in Route 53)
Allow time for DNS propagation (typically 5-15 minutes)
Step 9: Verify the Deployment
Check ECS Service Status
Go to ECS console → Clusters → zenml-hybrid → Services
Verify the service shows "Active"
Check that desired and running task counts match
Check Task Logs
Go to CloudWatch → Log Groups →
/ecs/zenml-hybridView log stream to look for startup messages
Verify no critical errors appear
Test HTTPS Access
Visit
https://zenml.mycompany.comin your browserYou should see ZenML Pro login redirecting to cloud.zenml.io
Verify Control Plane Connection
In CloudWatch logs, look for messages indicating successful connection to ZenML Cloud
Check for any authentication or SSL errors
Network & Firewall Requirements
Outbound Access to ZenML Cloud
Your ECS tasks need HTTPS (port 443) outbound access to:
cloudapi.zenml.io- For control plane authentication
This is enabled by the NAT Gateway and ECS security group configuration.
Inbound Access from Clients
Clients need HTTPS (port 443) inbound access to:
zenml.mycompany.com- Your ALB endpoint
This is enabled by the ALB and ALB security group configuration.
Database Access
ECS tasks need TCP access to:
Your RDS instance on port 3306 (MySQL)
This is enabled by the ECS security group egress rule and RDS security group ingress rule.
Scaling & High Availability
Multiple Tasks
For high availability:
Update the ECS service's desired count to 2 or more
ECS will distribute tasks across availability zones
The ALB automatically distributes traffic to all healthy tasks
Auto Scaling (Optional)
To automatically scale based on CPU or memory usage:
Register a scalable target (your ECS service)
Create a target tracking scaling policy
Set target CPU utilization (e.g., 70%)
Monitoring & Logging
CloudWatch Logs
Monitor your deployment:
Go to CloudWatch → Log Groups →
/ecs/zenml-hybridSet up log filters to find errors: filter for
ERRORorCRITICALCreate metric filters if needed
CloudWatch Alarms
Create alarms for:
High CPU Utilization: Alert when average CPU > 80%
Failed Tasks: Alert when tasks exit unexpectedly
Unhealthy Targets: Alert when ALB marks tasks as unhealthy
Application Logs
For production deployments:
Forward CloudWatch logs to your centralized logging system (ELK, Datadog, etc.)
Set up alerts for authentication failures to ZenML Cloud
Monitor database connection errors
Database Maintenance
Backups
Automated backups are configured, but:
Verify backup retention is set to at least 30 days
Test backup restoration periodically
Store backups in a different region for disaster recovery
Monitoring
Monitor database health:
Check RDS Performance Insights for slow queries
Review CloudWatch metrics for connection count and CPU
Monitor free storage space and create alerts
(Optional) Enable Snapshot Support / Workload Manager
Pipeline snapshots (running pipelines from the UI) require a workload manager. For ECS deployments, you'll typically use the AWS Kubernetes implementation if you also have a Kubernetes cluster available, or configure settings as appropriate for your infrastructure.
Prerequisites for Workload Manager
To enable snapshots on ECS-deployed ZenML workspaces:
Kubernetes Cluster Access - You'll need a Kubernetes cluster where the workload manager can run jobs. This could be:
The same EKS cluster as your other infrastructure
A separate EKS cluster dedicated to workloads
Another Kubernetes distribution in your environment
Container Registry Access - The workload manager needs access to your container registry to:
Pull base ZenML images
Push/pull runner images (if building them)
Storage Access - For AWS implementation:
S3 bucket for logs storage
IAM permissions to read/write to the bucket
Configuration Options
Option A: AWS Kubernetes Workload Manager (Recommended for ECS)
If you have an EKS cluster or other Kubernetes cluster available:
Create a dedicated namespace:
Add these environment variables to your ECS task definition:
VariableValueZENML_SERVER_WORKLOAD_MANAGER_IMPLEMENTATION_SOURCEzenml_cloud_plugins.aws_kubernetes_workload_manager.AWSKubernetesWorkloadManagerZENML_KUBERNETES_WORKLOAD_MANAGER_NAMESPACEzenml-workload-managerZENML_KUBERNETES_WORKLOAD_MANAGER_SERVICE_ACCOUNTzenml-runnerZENML_KUBERNETES_WORKLOAD_MANAGER_BUILD_RUNNER_IMAGEtrueZENML_KUBERNETES_WORKLOAD_MANAGER_DOCKER_REGISTRYYour ECR registry URI
ZENML_KUBERNETES_WORKLOAD_MANAGER_ENABLE_EXTERNAL_LOGStrueZENML_AWS_KUBERNETES_WORKLOAD_MANAGER_BUCKETYour S3 bucket for logs
ZENML_AWS_KUBERNETES_WORKLOAD_MANAGER_REGIONYour AWS region
ZENML_SERVER_MAX_CONCURRENT_TEMPLATE_RUNS2(or higher)ZENML_KUBERNETES_WORKLOAD_MANAGER_POD_RESOURCES{"requests": {"cpu": "500m", "memory": "512Mi"}, "limits": {"cpu": "2000m", "memory": "2Gi"}}Ensure the ECS task has permissions to access:
The Kubernetes cluster (kubeconfig/IAM role)
Your ECR registry
Your S3 bucket for logs
Option B: Kubernetes-based (Simpler Alternative)
If you prefer a basic setup without AWS-specific features:
Add these environment variables to your ECS task definition:
ZENML_SERVER_WORKLOAD_MANAGER_IMPLEMENTATION_SOURCE
zenml_cloud_plugins.kubernetes_workload_manager.KubernetesWorkloadManager
ZENML_KUBERNETES_WORKLOAD_MANAGER_NAMESPACE
zenml-workload-manager
ZENML_KUBERNETES_WORKLOAD_MANAGER_SERVICE_ACCOUNT
zenml-runner
ZENML_KUBERNETES_WORKLOAD_MANAGER_RUNNER_IMAGE
Your prebuilt ZenML image URI
Updating Task Definition
After configuring the workload manager environment variables:
Create a new task definition revision with the updated environment variables
Update your ECS service to use the new task definition
ECS will gradually replace running tasks with the new version
Monitor CloudWatch logs to verify the workload manager is operational
Troubleshooting
Task Won't Start
Check ECS task logs in CloudWatch:
Go to
/ecs/zenml-hybridlog groupLook for error messages about image pull failures or environment variable issues
Verify IAM execution role has correct permissions
Database Connection Failed
Verify database is running and accessible
Check ECS security group allows outbound to RDS security group
Verify
ZENML_DATABASE_URLhas correct hostname, port, and credentialsTest connectivity from an ECS task using a MySQL client
Can't Reach Server via HTTPS
Verify ALB is in "Active" state
Check ALB target group - tasks should show "Healthy"
Verify TLS certificate is valid for your domain
Check DNS resolution:
nslookup zenml.mycompany.com
Control Plane Connection Issues
Check CloudWatch logs for:
OAuth2 authentication errors - verify
ZENML_SERVER_PRO_OAUTH2_CLIENT_SECRETis correctNetwork connectivity errors - verify NAT Gateway is operational
Certificate validation errors - verify outbound HTTPS to cloudapi.zenml.io works
Updating the Deployment
Update Configuration
Modify environment variables in the task definition
Create a new task definition revision
Update the ECS service to use the new task definition
ECS will gradually replace old tasks with new ones
Upgrade ZenML Version
Update the container image in the task definition
Create a new task definition revision
Update the ECS service
Monitor CloudWatch logs during the update
Cleanup
To remove the deployment:
Delete ECS Service
Go to ECS → Clusters → zenml-hybrid → Services
Delete the zenml-server service
Set desired count to 0 first
Delete ECS Cluster
Delete the cluster once service is removed
Delete ALB
Go to EC2 → Load Balancers
Delete the ALB and associated target groups
Delete RDS Instance
Go to RDS → Databases
Delete the zenml-hybrid-db instance
Skip final snapshot if you don't need a backup
Delete VPC and Related Resources
Delete NAT Gateway (releases Elastic IP)
Delete subnets, route tables, security groups
Delete VPC
Clean Up Secrets
Go to Secrets Manager
Delete zenml/pro/oauth2-client-secret
Next Steps
Related Documentation
Last updated
Was this helpful?