Containerization
Customize Docker builds to run your pipelines in isolated, well-defined environments.
ZenML executes pipeline steps sequentially in the active Python environment when running locally. However, with remote orchestrators or step operators, ZenML builds Docker images to run your pipeline in an isolated, well-defined environment.
This page explains how ZenML's Docker build process works and how you can customize it to meet your specific requirements.
Understanding Docker Builds in ZenML
When a pipeline is run with a remote orchestrator, a Dockerfile is dynamically generated at runtime. It is then used to build the Docker image using the image builder component of your stack. The Dockerfile consists of the following steps:
Starts from a parent image that has ZenML installed. By default, this will use the official ZenML image for the Python and ZenML version that you're using in the active Python environment.
Installs additional pip dependencies. ZenML automatically detects which integrations are used in your stack and installs the required dependencies.
Optionally copies your source files. Your source files need to be available inside the Docker container so ZenML can execute your step code.
Sets user-defined environment variables.
The process described above is automated by ZenML and covers most basic use cases. This page covers various ways to customize the Docker build process to fit your specific needs.
Docker Build Process
ZenML uses the following process to decide how to build Docker images:
No
dockerfile
specified: If any of the options regarding requirements, environment variables, or copying files require us to build an image, ZenML will build this image. Otherwise, theparent_image
will be used to run the pipeline.dockerfile
specified: ZenML will first build an image based on the specified Dockerfile. If any additional options regarding requirements, environment variables, or copying files require an image built on top of that, ZenML will build a second image. If not, the image built from the specified Dockerfile will be used to run the pipeline.
Requirements Installation Order
Depending on the configuration of your Docker settings, requirements will be installed in the following order (each step is optional):
The packages installed in your local Python environment (if enabled)
The packages required by the stack (unless disabled by setting
install_stack_requirements=False
)The packages specified via the
required_integrations
The packages specified via the
requirements
attribute
For a full list of configuration options, check out the DockerSettings object on the SDKDocs.
Configuring Docker Settings
You can customize Docker builds for your pipelines and steps using the DockerSettings
class:
from zenml.config import DockerSettings
There are multiple ways to supply these settings:
Pipeline-Level Settings
Configuring settings on a pipeline applies them to all steps of that pipeline:
from zenml import pipeline, step
from zenml.config import DockerSettings
docker_settings = DockerSettings()
@step
def my_step() -> None:
"""Example step."""
pass
# Either add it to the decorator
@pipeline(settings={"docker": docker_settings})
def my_pipeline() -> None:
my_step()
# Or configure the pipelines options
my_pipeline = my_pipeline.with_options(
settings={"docker": docker_settings}
)
Step-Level Settings
For more fine-grained control, configure settings on individual steps. This is particularly useful when different steps have conflicting requirements or when some steps need specialized environments:
from zenml import step
from zenml.config import DockerSettings
docker_settings = DockerSettings()
# Either add it to the decorator
@step(settings={"docker": docker_settings})
def my_step() -> None:
pass
# Or configure the step options
my_step = my_step.with_options(
settings={"docker": docker_settings}
)
Using YAML Configuration
Define settings in a YAML configuration file for better separation of code and configuration:
settings:
docker:
parent_image: python:3.9-slim
apt_packages:
- git
- curl
requirements:
- tensorflow==2.8.0
- pandas
steps:
training_step:
settings:
docker:
parent_image: pytorch/pytorch:2.2.0-cuda11.8-cudnn8-runtime
required_integrations:
- wandb
- mlflow
Check out this page for more information on the hierarchy and precedence of the various ways in which you can supply the settings.
Specifying Docker Build Options
You can customize the build process by specifying build options that get passed to the build method of the image builder:
from zenml import pipeline
from zenml.config import DockerSettings
docker_settings = DockerSettings(
build_config={"build_options": {"buildargs": {"MY_ARG": "value"}}}
)
@pipeline(settings={"docker": docker_settings})
def my_pipeline(...):
...
For the default local image builder, these options are passed to the docker build
command.
Using Custom Parent Images
Pre-built Parent Images
To use a static parent image (e.g., with internal dependencies pre-installed):
from zenml import pipeline
from zenml.config import DockerSettings
docker_settings = DockerSettings(parent_image="my_registry.io/image_name:tag")
@pipeline(settings={"docker": docker_settings})
def my_pipeline(...):
...
ZenML will use this image as the base and still perform the following steps:
Install additional pip dependencies
Copy source files (if configured)
Set environment variables
Skip Build Process
To use the image directly to run your steps without including any code or installing any requirements on top of it, skip the Docker builds by setting skip_build=True
:
docker_settings = DockerSettings(
parent_image="my_registry.io/image_name:tag",
skip_build=True
)
@pipeline(settings={"docker": docker_settings})
def my_pipeline(...):
...
When skip_build
is enabled, the parent_image
will be used directly to run the steps of your pipeline without any additional Docker builds on top of it. This means that none of the following will happen:
No installation of local Python environment packages
No installation of stack requirements
No installation of required integrations
No installation of specified requirements
No installation of apt packages
No inclusion of source files in the container
No setting of environment variables
This is an advanced feature and may cause unintended behavior when running your pipelines. If you use this, ensure your image contains everything necessary to run your pipeline:
Your stack requirements
Integration requirements
Project-specific requirements
Any system packages
Your project code files (unless a code repository is registered or
allow_download_from_artifact_store
is enabled)
Make sure that Python, pip
and zenml
are installed in your image, and that your code is in the /app
directory set as the active working directory.
Also note that the Docker settings validator will raise an error if you set skip_build=True
without specifying a parent_image
. A parent image is required when skipping the build as it will be used directly to run your pipeline steps.
Custom Dockerfiles
For greater control, you can specify a custom Dockerfile and build context:
docker_settings = DockerSettings(
dockerfile="/path/to/dockerfile",
build_context_root="/path/to/build/context",
parent_image_build_config={
"build_options": {"buildargs": {"MY_ARG": "value"}},
"dockerignore": "/path/to/.dockerignore"
}
)
@pipeline(settings={"docker": docker_settings})
def my_pipeline(...):
...
Here is how the build process looks like with a custom Dockerfile:
Dockerfile
specified: ZenML will first build an image based on the specifiedDockerfile
. If any options regarding requirements, environment variables, or copying files require an additional image built on top of that, ZenML will build a second image. Otherwise, the image built from the specifiedDockerfile
will be used to run the pipeline.
Managing Dependencies
ZenML offers several ways to specify dependencies for your Docker containers:
Python Dependencies
By default, ZenML automatically installs all packages required by your active ZenML stack.
In future versions, if none of the replicate_local_python_environment
, pyproject_path
or requirements
attributes on DockerSettings
are specified, ZenML will try to automatically find a requirements.txt
and pyproject.toml
files inside your current source root and install packages from the first one it finds. You can disable this behavior by setting disable_automatic_requirements_detection=True
. If
you already want this automatic detection in current versions of ZenML, set disable_automatic_requirements_detection=False
.
Replicate Local Environment:
from zenml import pipeline from zenml.config import DockerSettings docker_settings = DockerSettings(replicate_local_python_environment=True) @pipeline(settings={"docker": docker_settings}) def my_pipeline(...): ...
This will run
pip freeze
to get a list of the installed packages in your local Python environment and will install them in the Docker image. This ensures that the same exact dependencies will be installed. {% hint style="warning" %} This does not work when you have a local project installed. To install local projects, check out theInstall Local Projects
section below. {% endhint %}Specify a
pyproject.toml
file:from zenml import pipeline from zenml.config import DockerSettings docker_settings = DockerSettings(pyproject_path="/path/to/pyproject.toml") @pipeline(settings={"docker": docker_settings}) def my_pipeline(...): ...
By default, ZenML will try to export the dependencies specified in the
pyproject.toml
by trying to runuv export
andpoetry export
. If both of these commands do not work for yourpyproject.toml
file or you want to customize the command (for example to install certain extras), you can specify a custom command using thepyproject_export_command
attribute. This command must output a list of requirements following the format of the requirements file. The command can contain a{directory}
placeholder which will be replaced with the directory in which thepyproject.toml
file is stored.from zenml import pipeline from zenml.config import DockerSettings docker_settings = DockerSettings(pyproject_export_command=[ "uv", "export", "--extra=train", "--format=requirements-txt" "--directory={directory} ]) @pipeline(settings={"docker": docker_settings}) def my_pipeline(...): ...
Specify Requirements Directly:
from zenml.config import DockerSettings docker_settings = DockerSettings(requirements=["torch==1.12.0", "torchvision"])
Use Requirements File:
from zenml.config import DockerSettings docker_settings = DockerSettings(requirements="/path/to/requirements.txt")
Specify ZenML Integrations:
from zenml.integrations.constants import PYTORCH, EVIDENTLY from zenml.config import DockerSettings docker_settings = DockerSettings(required_integrations=[PYTORCH, EVIDENTLY])
Control Stack Requirements: By default, ZenML installs the requirements needed by your active stack. You can disable this behavior if needed:
from zenml.config import DockerSettings docker_settings = DockerSettings(install_stack_requirements=False)
Install Local Projects: If your code requires the installation of some local code files as a python package, you can specify a command that installs it as follows:
from zenml.config import DockerSettings docker_settings = DockerSettings(local_project_install_command="pip install . --no-deps")
{% hint style="warning" %} Installing a local python package only works if your code files are included in the Docker image, so make sure you have
allow_including_files_in_images=True
in your Docker settings. If you want to instead use the code download functionality to avoid building new Docker images for each pipeline run, you can follow this example. {% endhint %}
Depending on the options specified in your Docker settings, ZenML installs the requirements in the following order (each step optional):
The packages installed in your local Python environment
The packages required by the stack (unless disabled by setting
install_stack_requirements=False
)The packages specified via the
required_integrations
The packages defined in the pyproject.toml file specified by the
pyproject_path
attributeThe packages specified via the
requirements
attribute
System Packages
Specify apt packages to be installed in the Docker image:
from zenml.config import DockerSettings
docker_settings = DockerSettings(apt_packages=["git", "curl", "libsm6", "libxext6"])
Installation Control
Control how packages are installed:
# Use custom installer arguments
docker_settings = DockerSettings(python_package_installer_args={"timeout": 1000})
# Use uv instead of pip
from zenml.config import DockerSettings, PythonPackageInstaller
docker_settings = DockerSettings(python_package_installer=PythonPackageInstaller.UV)
# Or as a string
docker_settings = DockerSettings(python_package_installer="uv")
# Use pip (default)
docker_settings = DockerSettings(python_package_installer=PythonPackageInstaller.PIP)
The available package installers are:
pip
: The default Python package installeruv
: A faster alternative to pip
In an upcoming release, ZenML will switch from pip
to uv
as the default package installer due to its significantly better performance. We encourage you to try it out in advance to prepare for this change:
docker_settings = DockerSettings(python_package_installer=PythonPackageInstaller.UV)
This will help ensure a smooth transition for the entire community. If you encounter any issues, please report them on our GitHub repository.
Full documentation for how uv
works with PyTorch can be found on the Astral Docs website here. It covers some of the particular gotchas and details you might need to know.
Private PyPI Repositories
For packages that require authentication from private repositories:
import os
docker_settings = DockerSettings(
requirements=["my-internal-package==0.1.0"],
environment={
'PIP_EXTRA_INDEX_URL': f"https://{os.environ.get('PYPI_TOKEN', '')}@my-private-pypi-server.com/{os.environ.get('PYPI_USERNAME', '')}/"}
)
Be cautious with handling credentials. Always use secure methods to manage and distribute authentication information within your team. Consider using secrets management tools or environment variables passed securely.
Source Code Management
ZenML determines the root directory of your source files in the following order:
If you've initialized zenml (
zenml init
) in your current working directory or one of its parent directories, the repository root directory will be used.Otherwise, the parent directory of the Python file you're executing will be the source root. For example, running
python /path/to/file.py
, the source root would be/path/to
.
You can specify how the files inside this root directory are handled:
docker_settings = DockerSettings(
# Download files from code repository if available
allow_download_from_code_repository=True,
# If no code repository, upload code to artifact store
allow_download_from_artifact_store=True,
# If neither of the above, include files in the image
allow_including_files_in_images=True
)
ZenML handles your source code in the following order:
If
allow_download_from_code_repository
isTrue
and your files are inside a registered code repository and the repository has no local changes, the files will be downloaded from the code repository and not included in the image.If the previous option is disabled or no code repository without local changes exists for the root directory, ZenML will archive and upload your code to the artifact store if
allow_download_from_artifact_store
isTrue
.If both previous options were disabled or not possible, ZenML will include your files in the Docker image if
allow_including_files_in_images
is enabled. This means a new Docker image has to be built each time you modify one of your code files.
Setting all of the above attributes to False
is not recommended and will most likely cause unintended and unanticipated behavior when running your pipelines. If you do this, you're responsible that all your files are at the correct paths in the Docker images that will be used to run your pipeline steps.
Controlling Included Files
When downloading files from a code repository, use a
.gitignore
file to exclude files.When including files in the image, use a
.dockerignore
file to exclude files and keep the image smaller:# Have a file called .dockerignore in your source root directory # Or explicitly specify a .dockerignore file to use: docker_settings = DockerSettings(build_config={"dockerignore": "/path/to/.dockerignore"})
Environment Variables
You can set environment variables that will be available in the Docker container:
docker_settings = DockerSettings(
environment={
"PYTHONUNBUFFERED": "1",
"MODEL_DIR": "/models",
"API_KEY": "${GLOBAL_API_KEY}" # Reference environment variables
}
)
Environment variables can reference other environment variables by using the ${VAR_NAME}
syntax. ZenML will substitute these at runtime.
Build Reuse and Optimization
ZenML automatically reuses Docker builds when possible to save time and resources:
What is a Pipeline Build?
A pipeline build is an encapsulation of a pipeline and the stack it was run on. It contains the Docker images that were built for the pipeline with all required dependencies from the stack, integrations and the user. Optionally, it also contains the pipeline code.
List all available builds for a pipeline:
zenml pipeline builds list --pipeline_id='startswith:ab53ca'
Create a build manually (useful for pre-building images):
zenml pipeline build --stack vertex-stack my_module.my_pipeline_instance
You can use options to specify the configuration file and the stack to use for the build. Learn more about the build function here.
Reusing Builds
By default, when you run a pipeline, ZenML will check if a build with the same pipeline and stack exists. If it does, it will reuse that build automatically. However, you can also force using a specific build by providing its ID:
pipeline_instance.run(build="<build_id>")
You can also specify this in configuration files:
build: your-build-id-here
Specifying a custom build when running a pipeline will not run the code on your client machine but will use the code included in the Docker images of the build. Even if you make local code changes, reusing a build will always execute the code bundled in the Docker image, rather than the local code.
Controlling Image Repository Names
You can control where your Docker image is pushed by specifying a target repository name:
from zenml.config import DockerSettings
docker_settings = DockerSettings(target_repository="my-custom-repo-name")
The repository name will be appended to the registry URI of your container registry stack component. For example, if your container registry URI is gcr.io/my-project
and you set target_repository="zenml-pipelines"
, the full image name would be gcr.io/my-project/zenml-pipelines
.
If you don't specify a target repository, the default repository name configured in your container registry stack component settings will be used.
Decoupling Code from Builds
To reuse Docker builds while still using your latest code changes, you need to decouple your code from the build. There are two main approaches:
1. Using the Artifact Store to Upload Code
You can let ZenML use the artifact store to upload your code. This is the default behavior if no code repository is detected and the allow_download_from_artifact_store
flag is not set to False
in your DockerSettings
.
2. Using Code Repositories for Faster Builds
Registering a code repository lets you avoid building images each time you run a pipeline and quickly iterate on your code. When running a pipeline that is part of a local code repository checkout, ZenML can instead build the Docker images without including any of your source files, and download the files inside the container before running your code.
ZenML will automatically figure out which builds match your pipeline and reuse the appropriate build id. Therefore, you do not need to explicitly pass in the build id when you have a clean repository state and a connected git repository.
In order to benefit from the advantages of having a code repository in a project, you need to make sure that the relevant integrations are installed for your ZenML installation.. For instance, let's assume you are working on a project with ZenML and one of your team members has already registered a corresponding code repository of type github
for it. If you do zenml code-repository list
, you would also be able to see this repository. However, in order to fully use this repository, you still need to install the corresponding integration for it, in this example the github
integration.
zenml integration install github
Detecting local code repository checkouts
Once you have registered one or more code repositories, ZenML will check whether the files you use when running a pipeline are tracked inside one of those code repositories. This happens as follows:
First, the source root is computed
Next, ZenML checks whether this source root directory is included in a local checkout of one of the registered code repositories
Tracking code versions for pipeline runs
If a local code repository checkout is detected when running a pipeline, ZenML will store a reference to the current commit for the pipeline run, so you'll be able to know exactly which code was used.
Note that this reference is only tracked if your local checkout is clean (i.e. it does not contain any untracked or uncommitted files). This is to ensure that your pipeline is actually running with the exact code stored at the specific code repository commit.
Preventing Build Reuse
There might be cases where you want to force a new build, even if a suitable existing build is available. You can do this by setting prevent_build_reuse=True
:
docker_settings = DockerSettings(prevent_build_reuse=True)
This is useful in scenarios like:
When you've made changes to your image building process that aren't tracked by ZenML
When troubleshooting issues in your Docker image
When you want to ensure your Docker image uses the most up-to-date base images
Tips and Best Practices for Build Reuse
Clean Repository State: The file download is only possible if the local checkout is clean (no untracked or uncommitted files) and the latest commit has been pushed to the remote repository.
Configuration Options: If you want to disable or enforce downloading of files, check the DockerSettings for available options.
Team Collaboration: Using code repositories allows team members to reuse images that colleagues might have built for the same stack, enhancing collaboration efficiency.
Build Selection: ZenML automatically selects matching builds, but you can override this with explicit build IDs for special cases.
Image Build Location
By default, execution environments are created locally using the local Docker client. However, this requires Docker installation and permissions. ZenML offers image builders, a special stack component, allowing users to build and push Docker images in a different specialized image builder environment.
Note that even if you don't configure an image builder in your stack, ZenML still uses the local image builder to retain consistency across all builds. In this case, the image builder environment is the same as the client environment.
You don't need to directly interact with any image builder in your code. As long as the image builder that you want to use is part of your active ZenML stack, it will be used automatically by any component that needs to build container images.
Container User Permissions
By default, Docker containers often run as the root
user, which can pose security risks. ZenML allows you to specify a different user to run your containers:
docker_settings = DockerSettings(user="non-root-user")
When you set the user
parameter:
The specified user will become the owner of the
/app
directory, which contains all your codeThe container entrypoint will run as this user instead of root
This can help improve security by following the principle of least privilege
Best Practices
Use code repositories to speed up builds and enable team collaboration. This approach is highly recommended for production environments.
Keep dependencies minimal to reduce build times. Only include packages you actually need.
Use fine-grained Docker settings at the step level for conflicting requirements. This prevents dependency conflicts and reduces image sizes.
Use pre-built images for common environments. This can significantly speed up your workflow.
Configure dockerignore files to reduce image size. Large Docker images take longer to build, push, and pull.
Leverage build caching by structuring your Dockerfiles and build processes to maximize cache hits.
Use environment variables for configuration instead of hardcoding values in your images.
Test your Docker builds locally before using them in production pipelines.
Keep your repository clean (no uncommitted changes) when running pipelines to ensure ZenML can correctly track code versions.
Use metadata and labels to help identify and manage your Docker images.
Run containers as non-root users when possible to improve security.
By following these practices, you can optimize your Docker builds in ZenML and create a more efficient workflow.
Last updated
Was this helpful?