You're a system architect tasked with setting up a scalable ML infrastructure that needs to:
Support multiple ML teams with different requirements
Work across multiple environments (dev, staging, prod)
Maintain security and compliance standards
Allow teams to iterate quickly without infrastructure bottlenecks
The ZenML Approach
ZenML introduces stack components as abstractions over infrastructure resources. Let's explore how to architect this effectively with Terraform using the official ZenML provider.
Part 1: Foundation - Stack Component Architecture
The Problem
Different teams need different ML infrastructure configurations, but you want to maintain consistency and reusability.
The Solution: Component-Based Architecture
Start by breaking down your infrastructure into reusable modules that map to ZenML stack components:
# modules/zenml_stack_base/main.tfterraform {required_providers { zenml = { source ="zenml-io/zenml" } google = { source ="hashicorp/google" } }}resource "random_id" "suffix" {# This will generate a string of 12 characters, encoded as base64 which makes# it 8 characters long byte_length =6}# Create base infrastructure resources, including a shared object storage,# and container registry. This module should also create resources used to# authenticate with the cloud provider and authorize access to the resources# (e.g. user accounts, service accounts, workload identities, roles,# permissions etc.)module "base_infrastructure" { source ="./modules/base_infra" environment = var.environment project_id = var.project_id region = var.region# Generate consistent random naming across resources resource_prefix ="zenml-${var.environment}-${random_id.suffix.hex}"}# Create a flexible service connector for authenticationresource "zenml_service_connector" "base_connector" { name ="${var.environment}-base-connector" type ="gcp" auth_method ="service-account" configuration = { project_id = var.project_id region = var.region service_account_json = module.base_infrastructure.service_account_key } labels = { environment = var.environment }}# Create base stack componentsresource "zenml_stack_component" "artifact_store" { name ="${var.environment}-artifact-store" type ="artifact_store" flavor ="gcp" configuration = { path ="gs://${module.base_infrastructure.artifact_store_bucket}/artifacts" } connector_id = zenml_service_connector.base_connector.id}resource "zenml_stack_component" "container_registry" { name ="${var.environment}-container-registry" type ="container_registry" flavor ="gcp" configuration = { uri = module.base_infrastructure.container_registry_uri } connector_id = zenml_service_connector.base_connector.id}resource "zenml_stack_component" "orchestrator" { name ="${var.environment}-orchestrator" type ="orchestrator" flavor ="vertex" configuration = { location = var.region workload_service_account ="${module.base_infrastructure.service_account_email}" } connector_id = zenml_service_connector.base_connector.id}# Create the base stackresource "zenml_stack" "base_stack" { name ="${var.environment}-base-stack" components = { artifact_store = zenml_stack_component.artifact_store.id container_registry = zenml_stack_component.container_registry.id orchestrator = zenml_stack_component.orchestrator.id } labels = { environment = var.environment type ="base" }}
Different environments (dev, staging, prod) require:
Different authentication methods and security levels
Environment-specific resource configurations
Isolation between environments to prevent cross-environment impacts
Consistent management patterns while maintaining flexibility
The Solution: Environment Configuration Pattern with Smart Authentication
Create a flexible service connector setup that adapts to your environment. For example, in development, a service account might be the more flexible pattern, while in production we go through workload identity. Combine environment-specific configurations with appropriate authentication methods:
Different ML projects often require strict isolation of data and security to prevent unauthorized access and ensure compliance with security policies. Ensuring that each project has its own isolated resources, such as artifact stores or orchestrators, is crucial to prevent data leakage and maintain the integrity of each project's environment. This focus on data and security isolation is essential for managing multiple ML projects securely and effectively.
The Solution: Resource Scoping Pattern
Implement resource sharing with project isolation:
locals { project_paths = { fraud_detection ="projects/fraud_detection/${var.environment}" recommendation ="projects/recommendation/${var.environment}" }}# Create shared artifact store components with project isolationresource "zenml_stack_component" "project_artifact_stores" { for_each = local.project_paths name ="${each.key}-artifact-store" type ="artifact_store" flavor ="gcp" configuration = { path ="gs://${var.shared_bucket}/${each.value}" } connector_id = zenml_service_connector.env_connector.id labels = { project = each.key environment = var.environment }}# The orchestrator is shared across all stacksresource "zenml_stack_component" "project_orchestrator" { name ="shared-orchestrator" type ="orchestrator" flavor ="vertex" configuration = { location = var.region project = var.project_id } connector_id = zenml_service_connector.env_connector.id labels = { environment = var.environment }}# Create project-specific stacks separated by artifact storesresource "zenml_stack" "project_stacks" { for_each = local.project_paths name ="${each.key}-stack" components = { artifact_store = zenml_stack_component.project_artifact_stores[each.key].id orchestrator = zenml_stack_component.project_orchestrator.id } labels = { project = each.key environment = var.environment }}
# Create environment-specific connectors with clear purposesresource "zenml_service_connector" "env_connector" { name ="${var.environment}-${var.purpose}-connector" type = var.connector_type# Use workload identity for production auth_method = var.environment =="prod"?"workload-identity":"service-account"# Use a specific resource type and resource ID resource_type = var.resource_type resource_id = var.resource_id labels =merge(local.common_labels, { purpose = var.purpose })}
# Group related components with clear dependency chainsmodule "ml_stack" { source ="./modules/ml_stack" depends_on = [ module.base_infrastructure, module.security ] components = {# Core components artifact_store = module.storage.artifact_store_id container_registry = module.container.registry_id# Optional components based on team needs orchestrator = var.needs_orchestrator ? module.compute.orchestrator_id :null experiment_tracker = var.needs_tracking ? module.mlflow.tracker_id :null } labels =merge(local.common_labels, { stack_type ="ml-platform" })}
State Management
terraform {backend "gcs" { prefix ="terraform/state" }# Separate state files for infrastructure and ZenML workspace_prefix ="zenml-"}# Use data sources to reference infrastructure statedata "terraform_remote_state" "infrastructure" { backend ="gcs" config = { bucket = var.state_bucket prefix ="terraform/infrastructure" }}
These practices help maintain a clean, scalable, and maintainable infrastructure codebase while following infrastructure-as-code best practices. Remember to:
Keep configurations DRY using locals and variables
Use consistent naming conventions across resources
Document all required configuration fields
Consider component dependencies when organizing stacks
Separate infrastructure and ZenML registration state
Ensure that the ML operations team manages the registration state to maintain control over the ZenML stack components and their configurations. This helps in keeping the infrastructure and ML operations aligned and allows for better tracking and auditing of changes.
Conclusion
Building ML infrastructure with ZenML and Terraform enables you to create a flexible, maintainable, and secure environment for ML teams. The official ZenML provider simplifies the process while maintaining clean infrastructure patterns.