Enabling GPU-backed hardware
How to ensure your pipelines or steps run on GPU-backed hardware
ZenML allows for multiple ways to configure the hardware on which your steps run, from step operator stack components to custom per-step or per-pipeline requirements. For steps or pipelines that are required to run on GPUs, it is essential to ensure that the environment has the required CUDA tools installed. The following section describes what you need to do to ensure that you will actually get the performance boost that running your training on a GPU will give you.
The steps that will run on GPU-backed hardware will all be running from a containerized environment, whether you're using our local Docker orchestrator or on a cloud instance of Kubeflow. (Please see the section on configuration of the Docker environment for general context on this and what follows.) For this reason, you will need to make two amendments to your Docker settings for the relevant steps as follows:
- 1.Specify a CUDA-enabled parent image in your
For full details, see the whole section where we explain how to do this on the containerization page. As an example, if you wanted to use the latest CUDA-enabled official PyTorch image for your entire pipeline run, you could include the following code:
docker_settings = DockerSettings(parent_image="pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime")
- 1.Add ZenML as an explicit pip requirement
ZenML requires that ZenML itself be installed for the containers running our pipelines and steps, so you will also need to explicitly state that ZenML should be installed. There are lots of ways to specify this, but as one example, you could do the following (updating the code from above):
docker_settings = DockerSettings(parent_image="pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime", requirements=["zenml==0.20.5", "torchvision"])
Adding these two extra settings options will be enough to ensure that CUDA is enabled for the specific steps that require GPU acceleration. Note that these configuration changes are required for the GPU hardware to be properly utilized. If you don't update the settings, your steps might run but they will not see any boost in performance from the custom hardware.
Note that you need to be quite careful with the image that you choose so that switching between local and remote environments doesn't get muddled. For example, you might have one version of PyTorch installed locally with a particular CUDA version, but then when you switch to your remote stack or environment you might be forced to use a different CUDA version.
The core cloud operators all offer prebuilt Docker images that fit with their hardware. You can find more information on them here:
Not all of these images are available on DockerHub, so your please ensure that the orchestrator environment your pipeline runs in has sufficient permission(s) to pull images from registries if you are using one of those.