Train with GPUs
Train ZenML pipelines on GPUs and scale out with 🤗 Accelerate.
Need more compute than your laptop can offer? This tutorial shows how to:
- Request GPU resources for individual steps. 
- Build a CUDA‑enabled container image so the GPU is actually visible. 
- Reset the CUDA cache between steps (optional but handy for memory‑heavy jobs). 
- Scale to multiple GPUs or nodes with the 🤗 Accelerate integration. 
1 Request extra resources for a step
If your orchestrator supports it you can reserve CPU, GPU and RAM directly on a ZenML @step:
from zenml import step
from zenml.config import ResourceSettings
@step(settings={
    "resources": ResourceSettings(cpu_count=8, gpu_count=2, memory="16GB")
})
def training_step(...):
    ...  # heavy training logic👉 Check your orchestrator's docs; some (e.g. SkyPilot) expose dedicated settings instead of ResourceSettings.
2 Build a CUDA‑enabled container image
Requesting a GPU is not enough—your Docker image needs the CUDA runtime, too.
from zenml import pipeline
from zenml.config import DockerSettings
docker = DockerSettings(
    parent_image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
    requirements=["zenml", "torchvision"]
)
@pipeline(settings={"docker": docker})
def my_gpu_pipeline(...):
    ...Use the official CUDA images for TensorFlow/PyTorch or the pre‑built ones offered by AWS, GCP or Azure.
Optional – clear the CUDA cache
If you squeeze every last MB out of the GPU consider clearing the cache at the beginning of each step:
import gc, torch
def cleanup_memory():
    while gc.collect():
        torch.cuda.empty_cache()Call cleanup_memory() at the start of your GPU steps.
3 Multi‑GPU / multi‑node training with 🤗 Accelerate
ZenML integrates with the Hugging Face Accelerate launcher. Wrap your training step with run_with_accelerate to fan it out over multiple GPUs or machines:
from zenml import step, pipeline
from zenml.integrations.huggingface.steps import run_with_accelerate
@run_with_accelerate(num_processes=4, multi_gpu=True)
@step
def training_step(...):
    ...  # your distributed training code
@pipeline
def dist_pipeline(...):
    training_step(...)Common arguments:
- num_processes: total processes to launch (one per GPU)
- multi_gpu=True: enable multi‑GPU mode
- cpu=True: force CPU training
- mixed_precision:- "fp16"/- "bf16"/- "no"
Accelerate‑decorated steps must be called with keyword arguments and cannot be wrapped a second time inside the pipeline definition.
Prepare the container
Use the same CUDA image as above plus add Accelerate to the requirements:
DockerSettings(
    parent_image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
    requirements=["zenml", "accelerate", "torchvision"]
)4 Troubleshooting & Tips
GPU is unused
Verify CUDA toolkit inside container (nvcc --version), check driver compatibility
OOM even after cache reset
Reduce batch size, use gradient accumulation, or request more GPU memory
Accelerate hangs
Make sure ports are open between nodes; pass main_process_port explicitly
Need help? Join us on Slack.
Last updated
Was this helpful?