# AWS ECS

This guide provides high-level instructions for deploying ZenML Pro in a Hybrid setup on AWS ECS (Elastic Container Service).

## Architecture Overview

In this setup:

* **ZenML workspace** runs in ECS tasks within your VPC
* **Load balancer** handles HTTPS traffic and routes to ECS tasks
* **Database** stores workspace metadata in AWS RDS
* **Secrets manager** stores Pro credentials securely
* **NAT gateway** enables outbound access to ZenML Cloud control plane

## Prerequisites

Before starting, make sure you go through the [general prerequisites for hybrid deployments](https://docs.zenml.io/pro/deployments/deploy-details/deploy-prerequisites) and have collected the necessary artifacts and information. Particular requirements for AWS ECS deployments are listed below.

* AWS Account with appropriate IAM permissions
* Basic familiarity with AWS ECS, VPC, and RDS

## Install the ZenML Pro Workspace Server

### Step 1: Enroll the Workspace in the ZenML Pro Control Plane

Make sure to enroll the workspace in the ZenML Pro control plane by following the [Enroll a Workspace in the ZenML Pro Control Plane](https://docs.zenml.io/pro/deployments/deploy-details/workspace-server/enroll-workspace) guide and collect the necessary enrollment credentials.

### Step 2: Set Up AWS Infrastructure

#### VPC and Subnets

Create a VPC with:

* **Public subnets** (at least 2 across different availability zones) - for the Application Load Balancer
* **Private subnets** (at least 2 across different availability zones) - for ECS tasks and RDS

#### Security Groups

Create three security groups:

1. **ALB Security Group**
   * Inbound: HTTPS (443) and HTTP (80) from `0.0.0.0/0`
   * Outbound: HTTP (8000) to the ECS security group
2. **ECS Security Group**
   * Inbound: HTTP (8000) from the ALB security group
   * Outbound: HTTPS (443) to `0.0.0.0/0` (for ZenML Cloud access)
   * Outbound: TCP (3306 for MySQL) to the RDS security group
3. **RDS Security Group**
   * Inbound: TCP (3306 for MySQL) from the ECS security group
   * Outbound: Not restricted

#### NAT Gateway

To enable ECS tasks to reach ZenML Cloud:

1. Create an Elastic IP in your AWS region
2. Create a NAT Gateway in one of your public subnets
3. Wait for the NAT Gateway to be available

#### Route Tables

For your private subnets (where ECS tasks run):

1. Create a route table
2. Add a default route (`0.0.0.0/0`) pointing to the NAT Gateway
3. Associate this route table with your private subnets

### Step 3: Set Up RDS Database

Create an RDS database instance. **Important**: Workspace servers only support MySQL, not PostgreSQL.

**Configuration:**

* **DB Engine**: MySQL 8.0+ (PostgreSQL is not supported for workspace servers)
* **Instance Class**: `db.t3.micro` or larger depending on expected load
* **Storage**: 100 GB initial (with automatic scaling enabled)
* **Multi-AZ**: Enable for production deployments
* **VPC**: Your ZenML VPC
* **Subnet Group**: Create a DB subnet group with your private subnets
* **Security Group**: RDS security group created above
* **Backups**: 30 days retention minimum
* **Logs**: Enable error, general, and slowquery logs to CloudWatch

**After creation:**

1. Note the database endpoint (hostname)
2. Create the initial database: `zenml_hybrid`
3. Create a database user with full permissions on the database

### Step 4: Store Secrets in AWS Secrets Manager

Store your Pro credentials securely:

1. **OAuth2 Client Secret**
   * Secret name: `zenml/pro/oauth2-client-secret`
   * Value: Your workspace enrollment key
2. (Optional) **Database Password**
   * Secret name: `zenml/rds/password`
   * Value: Your RDS database password

Note the ARN of your OAuth2 secret - you'll reference it in the task definition.

### Step 5: Create ECS IAM Roles

Create two IAM roles:

#### Task Execution Role

This role allows ECS to pull images and manage logs:

* Attach: `AmazonECSTaskExecutionRolePolicy`
* Add inline policy for Secrets Manager access:
  * Action: `secretsmanager:GetSecretValue`
  * Resource: Your OAuth2 secret ARN
  * Action: `logs:CreateLogGroup`, `logs:CreateLogStream`, `logs:PutLogEvents`
  * Resource: Your CloudWatch log group

#### Task Role

This role is for application-level permissions (optional for basic setup):

* Leave empty for now, or add policies if your tasks need to access other AWS services

### Step 6: Create ECS Task Definition

In the AWS Console or using AWS CLI/Terraform, create a task definition with:

**Task Configuration:**

* **Compatibility**: FARGATE
* **CPU**: 512 (0.5 vCPU)
* **Memory**: 1024 MB
* **Network Mode**: awsvpc
* **Execution Role**: Task execution role created above
* **Task Role**: Task role created above

**Container Configuration:**

* **Image**: `715803424590.dkr.ecr.eu-central-1.amazonaws.com/zenml-pro-server:<ZENML_OSS_VERSION>`
* **Port Mapping**: Container port 8000 to port 8000
* **Essential**: Yes

**Environment Variables:**

Set these in the task definition:

| Variable                             | Value                                                                                      |
| ------------------------------------ | ------------------------------------------------------------------------------------------ |
| `ZENML_SERVER_DEPLOYMENT_TYPE`       | `cloud`                                                                                    |
| `ZENML_SERVER_PRO_API_URL`           | `https://cloudapi.zenml.io`                                                                |
| `ZENML_SERVER_PRO_DASHBOARD_URL`     | `https://cloud.zenml.io`                                                                   |
| `ZENML_SERVER_PRO_ORGANIZATION_ID`   | Your organization ID from enrollment                                                       |
| `ZENML_SERVER_PRO_ORGANIZATION_NAME` | Your organization name from enrollment                                                     |
| `ZENML_SERVER_PRO_WORKSPACE_ID`      | Your workspace ID from enrollment                                                          |
| `ZENML_SERVER_PRO_WORKSPACE_NAME`    | Your workspace name from enrollment                                                        |
| `ZENML_SERVER_PRO_OAUTH2_AUDIENCE`   | `https://cloudapi.zenml.io`                                                                |
| `ZENML_SERVER_SERVER_URL`            | `https://zenml.mycompany.com`                                                              |
| `ZENML_DATABASE_URL`                 | `mysql://user:password@hostname:3306/zenml_hybrid` (MySQL only - PostgreSQL not supported) |
| `ZENML_SERVER_HOSTNAME`              | `0.0.0.0`                                                                                  |
| `ZENML_SERVER_PORT`                  | `8000`                                                                                     |
| `ZENML_LOGGING_LEVEL`                | `INFO`                                                                                     |

**Secrets:**

Reference your secret from Secrets Manager:

| Variable                                | Secret                                                                        |
| --------------------------------------- | ----------------------------------------------------------------------------- |
| `ZENML_SERVER_PRO_OAUTH2_CLIENT_SECRET` | `arn:aws:secretsmanager:region:account:secret:zenml/pro/oauth2-client-secret` |

**Logging:**

Configure CloudWatch logs:

* **Log Group**: `/ecs/zenml-hybrid`
* **Log Stream Prefix**: `ecs`
* **Region**: Your AWS region

### Step 7: Create ECS Cluster and Service

Create an ECS cluster named `zenml-hybrid`.

Then create an ECS service within this cluster:

**Service Configuration:**

* **Cluster**: zenml-hybrid
* **Task Definition**: zenml-hybrid (latest version)
* **Launch Type**: FARGATE
* **Desired Count**: 1 (or more for high availability)
* **Platform Version**: LATEST

**Network Configuration:**

* **VPC**: Your ZenML VPC
* **Subnets**: Your private subnets
* **Security Group**: ECS security group
* **Public IP**: Disabled (tasks don't need public IPs)

**Load Balancing:**

* **Load Balancer Type**: Application Load Balancer
* **Container**: zenml-server
* **Container Port**: 8000
* (Leave the target group selection for the next step)

### Step 8: Set Up Application Load Balancer

Create an Application Load Balancer (ALB):

**Configuration:**

* **Subnets**: Your public subnets
* **Security Group**: ALB security group

#### Target Group

Create a target group for your ECS service:

**Health Check Configuration:**

* **Protocol**: HTTP
* **Path**: `/health`
* **Port**: 8000
* **Interval**: 30 seconds
* **Timeout**: 5 seconds
* **Healthy Threshold**: 2
* **Unhealthy Threshold**: 3

#### Listeners

Create two listeners on your ALB:

1. **HTTPS Listener (Port 443)**
   * **Certificate**: Your TLS certificate from ACM or imported
   * **Default Action**: Forward to your target group
2. **HTTP Listener (Port 80)**
   * **Default Action**: Redirect to HTTPS (port 443)

### Step 9: Configure DNS

In your DNS provider (Route 53 or external):

1. Create an A record (or CNAME) pointing to your ALB's DNS name
   * **Name**: `zenml.mycompany.com`
   * **Target**: Your ALB's DNS name or IP
   * **Type**: A record (use Alias if in Route 53)
2. Allow time for DNS propagation (typically 5-15 minutes)

### Step 10: Verify the Deployment

1. **Check ECS Service Status**
   * Go to ECS console → Clusters → zenml-hybrid → Services
   * Verify the service shows "Active"
   * Check that desired and running task counts match
2. **Check Task Logs**
   * Go to CloudWatch → Log Groups → `/ecs/zenml-hybrid`
   * View log stream to look for startup messages
   * Verify no critical errors appear
3. **Test HTTPS Access**
   * Visit `https://zenml.mycompany.com` in your browser
   * You should see ZenML Pro login redirecting to cloud.zenml.io
4. **Verify Control Plane Connection**
   * In CloudWatch logs, look for messages indicating successful connection to ZenML Cloud
   * Check for any authentication or SSL errors

## Network & Firewall Requirements

### Outbound Access to ZenML Cloud

Your ECS tasks need HTTPS (port 443) outbound access to:

* `cloudapi.zenml.io` - For control plane authentication

This is enabled by the NAT Gateway and ECS security group configuration.

### Inbound Access from Clients

Clients need HTTPS (port 443) inbound access to:

* `zenml.mycompany.com` - Your ALB endpoint

This is enabled by the ALB and ALB security group configuration.

### Database Access

ECS tasks need TCP access to:

* Your RDS instance on port 3306 (MySQL)

This is enabled by the ECS security group egress rule and RDS security group ingress rule.

## Scaling & High Availability

### Multiple Tasks

For high availability:

1. Update the ECS service's desired count to 2 or more
2. ECS will distribute tasks across availability zones
3. The ALB automatically distributes traffic to all healthy tasks

### Auto Scaling (Optional)

To automatically scale based on CPU or memory usage:

1. Register a scalable target (your ECS service)
2. Create a target tracking scaling policy
3. Set target CPU utilization (e.g., 70%)

## Monitoring & Logging

### CloudWatch Logs

Monitor your deployment:

1. Go to CloudWatch → Log Groups → `/ecs/zenml-hybrid`
2. Set up log filters to find errors: filter for `ERROR` or `CRITICAL`
3. Create metric filters if needed

### CloudWatch Alarms

Create alarms for:

* **High CPU Utilization**: Alert when average CPU > 80%
* **Failed Tasks**: Alert when tasks exit unexpectedly
* **Unhealthy Targets**: Alert when ALB marks tasks as unhealthy

### Application Logs

For production deployments:

1. Forward CloudWatch logs to your centralized logging system (ELK, Datadog, etc.)
2. Set up alerts for authentication failures to ZenML Cloud
3. Monitor database connection errors

## Database Maintenance

### Backups

Automated backups are configured, but:

1. Verify backup retention is set to at least 30 days
2. Test backup restoration periodically
3. Store backups in a different region for disaster recovery

### Monitoring

Monitor database health:

1. Check RDS Performance Insights for slow queries
2. Review CloudWatch metrics for connection count and CPU
3. Monitor free storage space and create alerts

## Troubleshooting

### Task Won't Start

Check ECS task logs in CloudWatch:

1. Go to `/ecs/zenml-hybrid` log group
2. Look for error messages about image pull failures or environment variable issues
3. Verify IAM execution role has correct permissions

### Database Connection Failed

1. Verify database is running and accessible
2. Check ECS security group allows outbound to RDS security group
3. Verify `ZENML_DATABASE_URL` has correct hostname, port, and credentials
4. Test connectivity from an ECS task using a MySQL client

### Can't Reach Server via HTTPS

1. Verify ALB is in "Active" state
2. Check ALB target group - tasks should show "Healthy"
3. Verify TLS certificate is valid for your domain
4. Check DNS resolution: `nslookup zenml.mycompany.com`

### Control Plane Connection Issues

Check CloudWatch logs for:

1. OAuth2 authentication errors - verify `ZENML_SERVER_PRO_OAUTH2_CLIENT_SECRET` is correct
2. Network connectivity errors - verify NAT Gateway is operational
3. Certificate validation errors - verify outbound HTTPS to cloudapi.zenml.io works

## Updating the Deployment

### Update Configuration

1. Modify environment variables in the task definition
2. Create a new task definition revision
3. Update the ECS service to use the new task definition
4. ECS will gradually replace old tasks with new ones

### Upgrade ZenML Version

1. Update the container image in the task definition
2. Create a new task definition revision
3. Update the ECS service
4. Monitor CloudWatch logs during the update

## Cleanup

To remove the deployment:

1. **Delete ECS Service**
   * Go to ECS → Clusters → zenml-hybrid → Services
   * Delete the zenml-server service
   * Set desired count to 0 first
2. **Delete ECS Cluster**
   * Delete the cluster once service is removed
3. **Delete ALB**
   * Go to EC2 → Load Balancers
   * Delete the ALB and associated target groups
4. **Delete RDS Instance**
   * Go to RDS → Databases
   * Delete the zenml-hybrid-db instance
   * Skip final snapshot if you don't need a backup
5. **Delete VPC and Related Resources**
   * Delete NAT Gateway (releases Elastic IP)
   * Delete subnets, route tables, security groups
   * Delete VPC
6. **Clean Up Secrets**
   * Go to Secrets Manager
   * Delete zenml/pro/oauth2-client-secret

## Next Steps

* [Configure your organization in ZenML Cloud](https://cloud.zenml.io)
* [Set up users and teams](https://docs.zenml.io/pro/core-concepts/organization)
* [Configure stacks and service connectors](https://docs.zenml.io/concepts/stack_components)
* [Run your first pipeline](https://github.com/zenml-io/zenml/tree/main/examples/quickstart)

## Related Documentation

* [Hybrid Deployment Overview](https://docs.zenml.io/pro/deployments/scenarios/hybrid-deployment)
* [Self-hosted Deployment Overview](https://docs.zenml.io/pro/deployments/scenarios/self-hosted-deployment)
* [AWS ECS Documentation](https://docs.aws.amazon.com/ecs/)
* [AWS RDS Documentation](https://docs.aws.amazon.com/rds/)

<figure><img src="https://static.scarf.sh/a.png?x-pxid=f0b4f458-0a54-4fcd-aa95-d5ee424815bc" alt="ZenML Scarf"><figcaption></figcaption></figure>
