Train with GPUs
Train ZenML pipelines on GPUs and scale out with 🤗 Accelerate.
Need more compute than your laptop can offer? This tutorial shows how to:
Request GPU resources for individual steps.
Build a CUDA‑enabled container image so the GPU is actually visible.
Reset the CUDA cache between steps (optional but handy for memory‑heavy jobs).
Scale to multiple GPUs or nodes with the 🤗 Accelerate integration.
1 Request extra resources for a step
If your orchestrator supports it you can reserve CPU, GPU and RAM directly on a ZenML @step
:
👉 Check your orchestrator's docs; some (e.g. SkyPilot) expose dedicated settings instead of ResourceSettings
.
2 Build a CUDA‑enabled container image
Requesting a GPU is not enough—your Docker image needs the CUDA runtime, too.
Use the official CUDA images for TensorFlow/PyTorch or the pre‑built ones offered by AWS, GCP or Azure.
Optional – clear the CUDA cache
If you squeeze every last MB out of the GPU consider clearing the cache at the beginning of each step:
Call cleanup_memory()
at the start of your GPU steps.
3 Multi‑GPU / multi‑node training with 🤗 Accelerate
ZenML integrates with the Hugging Face Accelerate launcher. Wrap your training step with run_with_accelerate
to fan it out over multiple GPUs or machines:
Common arguments:
num_processes
: total processes to launch (one per GPU)multi_gpu=True
: enable multi‑GPU modecpu=True
: force CPU trainingmixed_precision
:"fp16"
/"bf16"
/"no"
Accelerate‑decorated steps must be called with keyword arguments and cannot be wrapped a second time inside the pipeline definition.
Prepare the container
Use the same CUDA image as above plus add Accelerate to the requirements:
4 Troubleshooting & Tips
GPU is unused
Verify CUDA toolkit inside container (nvcc --version
), check driver compatibility
OOM even after cache reset
Reduce batch size, use gradient accumulation, or request more GPU memory
Accelerate hangs
Make sure ports are open between nodes; pass main_process_port
explicitly
Need help? Join us on Slack.
Last updated
Was this helpful?