# AWS ECS

This guide provides high-level instructions for deploying ZenML Pro in a Hybrid setup on AWS ECS (Elastic Container Service).

## Architecture Overview

In this setup:

* **ZenML workspace** runs in ECS tasks within your VPC
* **Load balancer** handles HTTPS traffic and routes to ECS tasks
* **Database** stores workspace metadata in AWS RDS
* **Secrets manager** stores Pro credentials securely
* **NAT gateway** enables outbound access to ZenML Cloud control plane

## Prerequisites

Before starting, make sure you go through the [general prerequisites for hybrid deployments](/pro/deployments/deploy-details/deploy-prerequisites.md) and have collected the necessary artifacts and information. Particular requirements for AWS ECS deployments are listed below.

* AWS Account with appropriate IAM permissions
* Basic familiarity with AWS ECS, VPC, and RDS

## Install the ZenML Pro Workspace Server

### Step 1: Enroll the Workspace in the ZenML Pro Control Plane

Make sure to enroll the workspace in the ZenML Pro control plane by following the [Enroll a Workspace in the ZenML Pro Control Plane](/pro/deployments/deploy-details/workspace-server/enroll-workspace.md) guide and collect the necessary enrollment credentials.

### Step 2: Set Up AWS Infrastructure

#### VPC and Subnets

Create a VPC with:

* **Public subnets** (at least 2 across different availability zones) - for the Application Load Balancer
* **Private subnets** (at least 2 across different availability zones) - for ECS tasks and RDS

#### Security Groups

Create three security groups:

1. **ALB Security Group**
   * Inbound: HTTPS (443) and HTTP (80) from `0.0.0.0/0`
   * Outbound: HTTP (8000) to the ECS security group
2. **ECS Security Group**
   * Inbound: HTTP (8000) from the ALB security group
   * Outbound: HTTPS (443) to `0.0.0.0/0` (for ZenML Cloud access)
   * Outbound: TCP (3306 for MySQL) to the RDS security group
3. **RDS Security Group**
   * Inbound: TCP (3306 for MySQL) from the ECS security group
   * Outbound: Not restricted

#### NAT Gateway

To enable ECS tasks to reach ZenML Cloud:

1. Create an Elastic IP in your AWS region
2. Create a NAT Gateway in one of your public subnets
3. Wait for the NAT Gateway to be available

#### Route Tables

For your private subnets (where ECS tasks run):

1. Create a route table
2. Add a default route (`0.0.0.0/0`) pointing to the NAT Gateway
3. Associate this route table with your private subnets

### Step 3: Set Up RDS Database

Create an RDS database instance. **Important**: Workspace servers only support MySQL, not PostgreSQL.

**Configuration:**

* **DB Engine**: MySQL 8.0+ (PostgreSQL is not supported for workspace servers)
* **Instance Class**: `db.t3.micro` or larger depending on expected load
* **Storage**: 100 GB initial (with automatic scaling enabled)
* **Multi-AZ**: Enable for production deployments
* **VPC**: Your ZenML VPC
* **Subnet Group**: Create a DB subnet group with your private subnets
* **Security Group**: RDS security group created above
* **Backups**: 30 days retention minimum
* **Logs**: Enable error, general, and slowquery logs to CloudWatch

**After creation:**

1. Note the database endpoint (hostname)
2. Create the initial database: `zenml_hybrid`
3. Create a database user with full permissions on the database

### Step 4: Store Secrets in AWS Secrets Manager

Store your Pro credentials securely:

1. **OAuth2 Client Secret**
   * Secret name: `zenml/pro/oauth2-client-secret`
   * Value: Your workspace enrollment key
2. (Optional) **Database Password**
   * Secret name: `zenml/rds/password`
   * Value: Your RDS database password

Note the ARN of your OAuth2 secret - you'll reference it in the task definition.

### Step 5: Create ECS IAM Roles

Create two IAM roles:

#### Task Execution Role

This role allows ECS to pull images and manage logs:

* Attach: `AmazonECSTaskExecutionRolePolicy`
* Add inline policy for Secrets Manager access:
  * Action: `secretsmanager:GetSecretValue`
  * Resource: Your OAuth2 secret ARN
  * Action: `logs:CreateLogGroup`, `logs:CreateLogStream`, `logs:PutLogEvents`
  * Resource: Your CloudWatch log group

#### Task Role

This role is for application-level permissions (optional for basic setup):

* Leave empty for now, or add policies if your tasks need to access other AWS services

### Step 6: Create ECS Task Definition

In the AWS Console or using AWS CLI/Terraform, create a task definition with:

**Task Configuration:**

* **Compatibility**: FARGATE
* **CPU**: 512 (0.5 vCPU)
* **Memory**: 1024 MB
* **Network Mode**: awsvpc
* **Execution Role**: Task execution role created above
* **Task Role**: Task role created above

**Container Configuration:**

* **Image**: `715803424590.dkr.ecr.eu-central-1.amazonaws.com/zenml-pro-server:<ZENML_OSS_VERSION>`
* **Port Mapping**: Container port 8000 to port 8000
* **Essential**: Yes

**Environment Variables:**

Set these in the task definition:

| Variable                             | Value                                                                                      |
| ------------------------------------ | ------------------------------------------------------------------------------------------ |
| `ZENML_SERVER_DEPLOYMENT_TYPE`       | `cloud`                                                                                    |
| `ZENML_SERVER_PRO_API_URL`           | `https://cloudapi.zenml.io`                                                                |
| `ZENML_SERVER_PRO_DASHBOARD_URL`     | `https://cloud.zenml.io`                                                                   |
| `ZENML_SERVER_PRO_ORGANIZATION_ID`   | Your organization ID from enrollment                                                       |
| `ZENML_SERVER_PRO_ORGANIZATION_NAME` | Your organization name from enrollment                                                     |
| `ZENML_SERVER_PRO_WORKSPACE_ID`      | Your workspace ID from enrollment                                                          |
| `ZENML_SERVER_PRO_WORKSPACE_NAME`    | Your workspace name from enrollment                                                        |
| `ZENML_SERVER_PRO_OAUTH2_AUDIENCE`   | `https://cloudapi.zenml.io`                                                                |
| `ZENML_SERVER_SERVER_URL`            | `https://zenml.mycompany.com`                                                              |
| `ZENML_DATABASE_URL`                 | `mysql://user:password@hostname:3306/zenml_hybrid` (MySQL only - PostgreSQL not supported) |
| `ZENML_SERVER_HOSTNAME`              | `0.0.0.0`                                                                                  |
| `ZENML_SERVER_PORT`                  | `8000`                                                                                     |
| `ZENML_LOGGING_LEVEL`                | `INFO`                                                                                     |

**Secrets:**

Reference your secret from Secrets Manager:

| Variable                                | Secret                                                                        |
| --------------------------------------- | ----------------------------------------------------------------------------- |
| `ZENML_SERVER_PRO_OAUTH2_CLIENT_SECRET` | `arn:aws:secretsmanager:region:account:secret:zenml/pro/oauth2-client-secret` |

**Logging:**

Configure CloudWatch logs:

* **Log Group**: `/ecs/zenml-hybrid`
* **Log Stream Prefix**: `ecs`
* **Region**: Your AWS region

### Step 7: Create ECS Cluster and Service

Create an ECS cluster named `zenml-hybrid`.

Then create an ECS service within this cluster:

**Service Configuration:**

* **Cluster**: zenml-hybrid
* **Task Definition**: zenml-hybrid (latest version)
* **Launch Type**: FARGATE
* **Desired Count**: 1 (or more for high availability)
* **Platform Version**: LATEST

**Network Configuration:**

* **VPC**: Your ZenML VPC
* **Subnets**: Your private subnets
* **Security Group**: ECS security group
* **Public IP**: Disabled (tasks don't need public IPs)

**Load Balancing:**

* **Load Balancer Type**: Application Load Balancer
* **Container**: zenml-server
* **Container Port**: 8000
* (Leave the target group selection for the next step)

### Step 8: Set Up Application Load Balancer

Create an Application Load Balancer (ALB):

**Configuration:**

* **Subnets**: Your public subnets
* **Security Group**: ALB security group

#### Target Group

Create a target group for your ECS service:

**Health Check Configuration:**

* **Protocol**: HTTP
* **Path**: `/health`
* **Port**: 8000
* **Interval**: 30 seconds
* **Timeout**: 5 seconds
* **Healthy Threshold**: 2
* **Unhealthy Threshold**: 3

#### Listeners

Create two listeners on your ALB:

1. **HTTPS Listener (Port 443)**
   * **Certificate**: Your TLS certificate from ACM or imported
   * **Default Action**: Forward to your target group
2. **HTTP Listener (Port 80)**
   * **Default Action**: Redirect to HTTPS (port 443)

### Step 9: Configure DNS

In your DNS provider (Route 53 or external):

1. Create an A record (or CNAME) pointing to your ALB's DNS name
   * **Name**: `zenml.mycompany.com`
   * **Target**: Your ALB's DNS name or IP
   * **Type**: A record (use Alias if in Route 53)
2. Allow time for DNS propagation (typically 5-15 minutes)

### Step 10: Verify the Deployment

1. **Check ECS Service Status**
   * Go to ECS console → Clusters → zenml-hybrid → Services
   * Verify the service shows "Active"
   * Check that desired and running task counts match
2. **Check Task Logs**
   * Go to CloudWatch → Log Groups → `/ecs/zenml-hybrid`
   * View log stream to look for startup messages
   * Verify no critical errors appear
3. **Test HTTPS Access**
   * Visit `https://zenml.mycompany.com` in your browser
   * You should see ZenML Pro login redirecting to cloud.zenml.io
4. **Verify Control Plane Connection**
   * In CloudWatch logs, look for messages indicating successful connection to ZenML Cloud
   * Check for any authentication or SSL errors

## Network & Firewall Requirements

### Outbound Access to ZenML Cloud

Your ECS tasks need HTTPS (port 443) outbound access to:

* `cloudapi.zenml.io` - For control plane authentication

This is enabled by the NAT Gateway and ECS security group configuration.

### Inbound Access from Clients

Clients need HTTPS (port 443) inbound access to:

* `zenml.mycompany.com` - Your ALB endpoint

This is enabled by the ALB and ALB security group configuration.

### Database Access

ECS tasks need TCP access to:

* Your RDS instance on port 3306 (MySQL)

This is enabled by the ECS security group egress rule and RDS security group ingress rule.

## Scaling & High Availability

### Multiple Tasks

For high availability:

1. Update the ECS service's desired count to 2 or more
2. ECS will distribute tasks across availability zones
3. The ALB automatically distributes traffic to all healthy tasks

### Auto Scaling (Optional)

To automatically scale based on CPU or memory usage:

1. Register a scalable target (your ECS service)
2. Create a target tracking scaling policy
3. Set target CPU utilization (e.g., 70%)

## Monitoring & Logging

### CloudWatch Logs

Monitor your deployment:

1. Go to CloudWatch → Log Groups → `/ecs/zenml-hybrid`
2. Set up log filters to find errors: filter for `ERROR` or `CRITICAL`
3. Create metric filters if needed

### CloudWatch Alarms

Create alarms for:

* **High CPU Utilization**: Alert when average CPU > 80%
* **Failed Tasks**: Alert when tasks exit unexpectedly
* **Unhealthy Targets**: Alert when ALB marks tasks as unhealthy

### Application Logs

For production deployments:

1. Forward CloudWatch logs to your centralized logging system (ELK, Datadog, etc.)
2. Set up alerts for authentication failures to ZenML Cloud
3. Monitor database connection errors

## Database Maintenance

### Backups

Automated backups are configured, but:

1. Verify backup retention is set to at least 30 days
2. Test backup restoration periodically
3. Store backups in a different region for disaster recovery

### Monitoring

Monitor database health:

1. Check RDS Performance Insights for slow queries
2. Review CloudWatch metrics for connection count and CPU
3. Monitor free storage space and create alerts

## Troubleshooting

### Task Won't Start

Check ECS task logs in CloudWatch:

1. Go to `/ecs/zenml-hybrid` log group
2. Look for error messages about image pull failures or environment variable issues
3. Verify IAM execution role has correct permissions

### Database Connection Failed

1. Verify database is running and accessible
2. Check ECS security group allows outbound to RDS security group
3. Verify `ZENML_DATABASE_URL` has correct hostname, port, and credentials
4. Test connectivity from an ECS task using a MySQL client

### Can't Reach Server via HTTPS

1. Verify ALB is in "Active" state
2. Check ALB target group - tasks should show "Healthy"
3. Verify TLS certificate is valid for your domain
4. Check DNS resolution: `nslookup zenml.mycompany.com`

### Control Plane Connection Issues

Check CloudWatch logs for:

1. OAuth2 authentication errors - verify `ZENML_SERVER_PRO_OAUTH2_CLIENT_SECRET` is correct
2. Network connectivity errors - verify NAT Gateway is operational
3. Certificate validation errors - verify outbound HTTPS to cloudapi.zenml.io works

## Updating the Deployment

### Update Configuration

1. Modify environment variables in the task definition
2. Create a new task definition revision
3. Update the ECS service to use the new task definition
4. ECS will gradually replace old tasks with new ones

### Upgrade ZenML Version

1. Update the container image in the task definition
2. Create a new task definition revision
3. Update the ECS service
4. Monitor CloudWatch logs during the update

## Cleanup

To remove the deployment:

1. **Delete ECS Service**
   * Go to ECS → Clusters → zenml-hybrid → Services
   * Delete the zenml-server service
   * Set desired count to 0 first
2. **Delete ECS Cluster**
   * Delete the cluster once service is removed
3. **Delete ALB**
   * Go to EC2 → Load Balancers
   * Delete the ALB and associated target groups
4. **Delete RDS Instance**
   * Go to RDS → Databases
   * Delete the zenml-hybrid-db instance
   * Skip final snapshot if you don't need a backup
5. **Delete VPC and Related Resources**
   * Delete NAT Gateway (releases Elastic IP)
   * Delete subnets, route tables, security groups
   * Delete VPC
6. **Clean Up Secrets**
   * Go to Secrets Manager
   * Delete zenml/pro/oauth2-client-secret

## Next Steps

* [Configure your organization in ZenML Cloud](https://cloud.zenml.io)
* [Set up users and teams](/pro/core-concepts/organization.md)
* [Configure stacks and service connectors](https://docs.zenml.io/concepts/stack_components)
* [Run your first pipeline](https://github.com/zenml-io/zenml/tree/main/examples/quickstart)

## Related Documentation

* [Hybrid Deployment Overview](/pro/deployments/scenarios/hybrid-deployment.md)
* [Self-hosted Deployment Overview](/pro/deployments/scenarios/self-hosted-deployment.md)
* [AWS ECS Documentation](https://docs.aws.amazon.com/ecs/)
* [AWS RDS Documentation](https://docs.aws.amazon.com/rds/)

<figure><img src="https://static.scarf.sh/a.png?x-pxid=f0b4f458-0a54-4fcd-aa95-d5ee424815bc" alt="ZenML Scarf"><figcaption></figcaption></figure>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.zenml.io/pro/deployments/deploy-details/workspace-server/deploy-workspace-ecs.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
