AWS ECS
Deploy ZenML Pro Hybrid on AWS ECS with a managed control plane.
This guide provides high-level instructions for deploying ZenML Pro in a Hybrid setup on AWS ECS (Elastic Container Service).
Architecture Overview
In this setup:
ZenML workspace runs in ECS tasks within your VPC
Load balancer handles HTTPS traffic and routes to ECS tasks
Database stores workspace metadata in AWS RDS
Secrets manager stores Pro credentials securely
NAT gateway enables outbound access to ZenML Cloud control plane
Prerequisites
Before starting, make sure you go through the general prerequisites for hybrid deployments and have collected the necessary artifacts and information. Particular requirements for AWS ECS deployments are listed below.
AWS Account with appropriate IAM permissions
Basic familiarity with AWS ECS, VPC, and RDS
Install the ZenML Pro Workspace Server
Step 1: Enroll the Workspace in the ZenML Pro Control Plane
Make sure to enroll the workspace in the ZenML Pro control plane by following the Enroll a Workspace in the ZenML Pro Control Plane guide and collect the necessary enrollment credentials.
Step 2: Set Up AWS Infrastructure
VPC and Subnets
Create a VPC with:
Public subnets (at least 2 across different availability zones) - for the Application Load Balancer
Private subnets (at least 2 across different availability zones) - for ECS tasks and RDS
Security Groups
Create three security groups:
ALB Security Group
Inbound: HTTPS (443) and HTTP (80) from
0.0.0.0/0Outbound: HTTP (8000) to the ECS security group
ECS Security Group
Inbound: HTTP (8000) from the ALB security group
Outbound: HTTPS (443) to
0.0.0.0/0(for ZenML Cloud access)Outbound: TCP (3306 for MySQL) to the RDS security group
RDS Security Group
Inbound: TCP (3306 for MySQL) from the ECS security group
Outbound: Not restricted
NAT Gateway
To enable ECS tasks to reach ZenML Cloud:
Create an Elastic IP in your AWS region
Create a NAT Gateway in one of your public subnets
Wait for the NAT Gateway to be available
Route Tables
For your private subnets (where ECS tasks run):
Create a route table
Add a default route (
0.0.0.0/0) pointing to the NAT GatewayAssociate this route table with your private subnets
Step 3: Set Up RDS Database
Create an RDS database instance. Important: Workspace servers only support MySQL, not PostgreSQL.
Configuration:
DB Engine: MySQL 8.0+ (PostgreSQL is not supported for workspace servers)
Instance Class:
db.t3.microor larger depending on expected loadStorage: 100 GB initial (with automatic scaling enabled)
Multi-AZ: Enable for production deployments
VPC: Your ZenML VPC
Subnet Group: Create a DB subnet group with your private subnets
Security Group: RDS security group created above
Backups: 30 days retention minimum
Logs: Enable error, general, and slowquery logs to CloudWatch
After creation:
Note the database endpoint (hostname)
Create the initial database:
zenml_hybridCreate a database user with full permissions on the database
Step 4: Store Secrets in AWS Secrets Manager
Store your Pro credentials securely:
OAuth2 Client Secret
Secret name:
zenml/pro/oauth2-client-secretValue: Your workspace enrollment key
(Optional) Database Password
Secret name:
zenml/rds/passwordValue: Your RDS database password
Note the ARN of your OAuth2 secret - you'll reference it in the task definition.
Step 5: Create ECS IAM Roles
Create two IAM roles:
Task Execution Role
This role allows ECS to pull images and manage logs:
Attach:
AmazonECSTaskExecutionRolePolicyAdd inline policy for Secrets Manager access:
Action:
secretsmanager:GetSecretValueResource: Your OAuth2 secret ARN
Action:
logs:CreateLogGroup,logs:CreateLogStream,logs:PutLogEventsResource: Your CloudWatch log group
Task Role
This role is for application-level permissions (optional for basic setup):
Leave empty for now, or add policies if your tasks need to access other AWS services
Step 6: Create ECS Task Definition
In the AWS Console or using AWS CLI/Terraform, create a task definition with:
Task Configuration:
Compatibility: FARGATE
CPU: 512 (0.5 vCPU)
Memory: 1024 MB
Network Mode: awsvpc
Execution Role: Task execution role created above
Task Role: Task role created above
Container Configuration:
Image:
715803424590.dkr.ecr.eu-central-1.amazonaws.com/zenml-pro-server:<ZENML_OSS_VERSION>Port Mapping: Container port 8000 to port 8000
Essential: Yes
Environment Variables:
Set these in the task definition:
ZENML_SERVER_DEPLOYMENT_TYPE
cloud
ZENML_SERVER_PRO_API_URL
https://cloudapi.zenml.io
ZENML_SERVER_PRO_DASHBOARD_URL
https://cloud.zenml.io
ZENML_SERVER_PRO_ORGANIZATION_ID
Your organization ID from enrollment
ZENML_SERVER_PRO_ORGANIZATION_NAME
Your organization name from enrollment
ZENML_SERVER_PRO_WORKSPACE_ID
Your workspace ID from enrollment
ZENML_SERVER_PRO_WORKSPACE_NAME
Your workspace name from enrollment
ZENML_SERVER_PRO_OAUTH2_AUDIENCE
https://cloudapi.zenml.io
ZENML_SERVER_SERVER_URL
https://zenml.mycompany.com
ZENML_DATABASE_URL
mysql://user:password@hostname:3306/zenml_hybrid (MySQL only - PostgreSQL not supported)
ZENML_SERVER_HOSTNAME
0.0.0.0
ZENML_SERVER_PORT
8000
ZENML_LOGGING_LEVEL
INFO
Secrets:
Reference your secret from Secrets Manager:
ZENML_SERVER_PRO_OAUTH2_CLIENT_SECRET
arn:aws:secretsmanager:region:account:secret:zenml/pro/oauth2-client-secret
Logging:
Configure CloudWatch logs:
Log Group:
/ecs/zenml-hybridLog Stream Prefix:
ecsRegion: Your AWS region
Step 7: Create ECS Cluster and Service
Create an ECS cluster named zenml-hybrid.
Then create an ECS service within this cluster:
Service Configuration:
Cluster: zenml-hybrid
Task Definition: zenml-hybrid (latest version)
Launch Type: FARGATE
Desired Count: 1 (or more for high availability)
Platform Version: LATEST
Network Configuration:
VPC: Your ZenML VPC
Subnets: Your private subnets
Security Group: ECS security group
Public IP: Disabled (tasks don't need public IPs)
Load Balancing:
Load Balancer Type: Application Load Balancer
Container: zenml-server
Container Port: 8000
(Leave the target group selection for the next step)
Step 8: Set Up Application Load Balancer
Create an Application Load Balancer (ALB):
Configuration:
Subnets: Your public subnets
Security Group: ALB security group
Target Group
Create a target group for your ECS service:
Health Check Configuration:
Protocol: HTTP
Path:
/healthPort: 8000
Interval: 30 seconds
Timeout: 5 seconds
Healthy Threshold: 2
Unhealthy Threshold: 3
Listeners
Create two listeners on your ALB:
HTTPS Listener (Port 443)
Certificate: Your TLS certificate from ACM or imported
Default Action: Forward to your target group
HTTP Listener (Port 80)
Default Action: Redirect to HTTPS (port 443)
Step 9: Configure DNS
In your DNS provider (Route 53 or external):
Create an A record (or CNAME) pointing to your ALB's DNS name
Name:
zenml.mycompany.comTarget: Your ALB's DNS name or IP
Type: A record (use Alias if in Route 53)
Allow time for DNS propagation (typically 5-15 minutes)
Step 10: Verify the Deployment
Check ECS Service Status
Go to ECS console → Clusters → zenml-hybrid → Services
Verify the service shows "Active"
Check that desired and running task counts match
Check Task Logs
Go to CloudWatch → Log Groups →
/ecs/zenml-hybridView log stream to look for startup messages
Verify no critical errors appear
Test HTTPS Access
Visit
https://zenml.mycompany.comin your browserYou should see ZenML Pro login redirecting to cloud.zenml.io
Verify Control Plane Connection
In CloudWatch logs, look for messages indicating successful connection to ZenML Cloud
Check for any authentication or SSL errors
Network & Firewall Requirements
Outbound Access to ZenML Cloud
Your ECS tasks need HTTPS (port 443) outbound access to:
cloudapi.zenml.io- For control plane authentication
This is enabled by the NAT Gateway and ECS security group configuration.
Inbound Access from Clients
Clients need HTTPS (port 443) inbound access to:
zenml.mycompany.com- Your ALB endpoint
This is enabled by the ALB and ALB security group configuration.
Database Access
ECS tasks need TCP access to:
Your RDS instance on port 3306 (MySQL)
This is enabled by the ECS security group egress rule and RDS security group ingress rule.
Scaling & High Availability
Multiple Tasks
For high availability:
Update the ECS service's desired count to 2 or more
ECS will distribute tasks across availability zones
The ALB automatically distributes traffic to all healthy tasks
Auto Scaling (Optional)
To automatically scale based on CPU or memory usage:
Register a scalable target (your ECS service)
Create a target tracking scaling policy
Set target CPU utilization (e.g., 70%)
Monitoring & Logging
CloudWatch Logs
Monitor your deployment:
Go to CloudWatch → Log Groups →
/ecs/zenml-hybridSet up log filters to find errors: filter for
ERRORorCRITICALCreate metric filters if needed
CloudWatch Alarms
Create alarms for:
High CPU Utilization: Alert when average CPU > 80%
Failed Tasks: Alert when tasks exit unexpectedly
Unhealthy Targets: Alert when ALB marks tasks as unhealthy
Application Logs
For production deployments:
Forward CloudWatch logs to your centralized logging system (ELK, Datadog, etc.)
Set up alerts for authentication failures to ZenML Cloud
Monitor database connection errors
Database Maintenance
Backups
Automated backups are configured, but:
Verify backup retention is set to at least 30 days
Test backup restoration periodically
Store backups in a different region for disaster recovery
Monitoring
Monitor database health:
Check RDS Performance Insights for slow queries
Review CloudWatch metrics for connection count and CPU
Monitor free storage space and create alerts
Troubleshooting
Task Won't Start
Check ECS task logs in CloudWatch:
Go to
/ecs/zenml-hybridlog groupLook for error messages about image pull failures or environment variable issues
Verify IAM execution role has correct permissions
Database Connection Failed
Verify database is running and accessible
Check ECS security group allows outbound to RDS security group
Verify
ZENML_DATABASE_URLhas correct hostname, port, and credentialsTest connectivity from an ECS task using a MySQL client
Can't Reach Server via HTTPS
Verify ALB is in "Active" state
Check ALB target group - tasks should show "Healthy"
Verify TLS certificate is valid for your domain
Check DNS resolution:
nslookup zenml.mycompany.com
Control Plane Connection Issues
Check CloudWatch logs for:
OAuth2 authentication errors - verify
ZENML_SERVER_PRO_OAUTH2_CLIENT_SECRETis correctNetwork connectivity errors - verify NAT Gateway is operational
Certificate validation errors - verify outbound HTTPS to cloudapi.zenml.io works
Updating the Deployment
Update Configuration
Modify environment variables in the task definition
Create a new task definition revision
Update the ECS service to use the new task definition
ECS will gradually replace old tasks with new ones
Upgrade ZenML Version
Update the container image in the task definition
Create a new task definition revision
Update the ECS service
Monitor CloudWatch logs during the update
Cleanup
To remove the deployment:
Delete ECS Service
Go to ECS → Clusters → zenml-hybrid → Services
Delete the zenml-server service
Set desired count to 0 first
Delete ECS Cluster
Delete the cluster once service is removed
Delete ALB
Go to EC2 → Load Balancers
Delete the ALB and associated target groups
Delete RDS Instance
Go to RDS → Databases
Delete the zenml-hybrid-db instance
Skip final snapshot if you don't need a backup
Delete VPC and Related Resources
Delete NAT Gateway (releases Elastic IP)
Delete subnets, route tables, security groups
Delete VPC
Clean Up Secrets
Go to Secrets Manager
Delete zenml/pro/oauth2-client-secret
Next Steps
Related Documentation
Last updated
Was this helpful?