Train with GPUs
Ensuring your pipelines or steps run on GPU-backed hardware.
There are several reasons why you may want to scale your machine learning pipelines to the cloud, such as utilizing more powerful hardware or distributing tasks across multiple nodes. In order to achieve this with ZenML you'll need to run your steps on GPU-backed hardware using ResourceSettings
to allocate greater resources on an orchestrator node and/or make some adjustments to the container environment.
Specify resource requirements for steps
Some steps of your machine learning pipeline might be more resource-intensive and require special hardware to execute. In such cases, you can specify the required resources for steps as follows:
If the underlying orchestrator in your stack then supports specifying resources, this setting will attempt to secure these resources. Some orchestrators (like the Skypilot orchestrator) do not support ResourceSettings
directly, but rather use their Orchestrator
specific settings to achieve the same effect:
Please refer to the source code and documentation of each orchestrator to find out which orchestrator supports specifying resources in what way.
If you're using an orchestrator which does not support this feature or its underlying infrastructure does not cover your requirements, you can also take a look at step operators which allow you to execute individual steps of your pipeline in environments independent of your orchestrator.
Ensure your container is CUDA-enabled
To run steps or pipelines on GPUs, it's crucial to have the necessary CUDA tools installed in the environment. This section will guide you on how to configure your environment to utilize GPU capabilities effectively.
Note that these configuration changes are required for the GPU hardware to be properly utilized. If you don't update the settings, your steps might run, but they will not see any boost in performance from the custom hardware.
All steps running on GPU-backed hardware will be executed within a containerized environment, whether you're using the local Docker orchestrator or a cloud instance of Kubeflow. Therefore, you need to make two amendments to your Docker settings for the relevant steps:
1. Specify a CUDA-enabled parent image in your DockerSettings
DockerSettings
For complete details, refer to the containerization page that explains how to do this. As an example, if you want to use the latest CUDA-enabled official PyTorch image for your entire pipeline run, you can include the following code:
For TensorFlow, you might use the tensorflow/tensorflow:latest-gpu
image, as detailed in the official TensorFlow documentation or their DockerHub overview.
2. Add ZenML as an explicit pip requirement
ZenML requires that ZenML itself be installed for the containers running your pipelines and steps. Therefore, you need to explicitly state that ZenML should be installed. There are several ways to specify this, but as an example, you can update the code from above as follows:
Adding these two extra settings options will ensure that CUDA is enabled for the specific steps that require GPU acceleration. Be cautious when choosing the image to avoid confusion when switching between local and remote environments. For example, you might have one version of PyTorch installed locally with a particular CUDA version, but when you switch to your remote stack or environment, you might be forced to use a different CUDA version.
The core cloud operators offer prebuilt Docker images that fit with their hardware. You can find more information on them here:
Not all of these images are available on DockerHub, so ensure that the orchestrator environment your pipeline runs in has sufficient permissions to pull images from registries if you are using one of those.
Reset the CUDA cache in between steps
Your use case will determine whether this is necessary or makes sense to do, but we have seen that resetting the CUDA cache in between steps can help avoid issues with the GPU cache. This is particularly necessary if your training jobs are pushing the boundaries of the GPU cache. Doing so is simple; just use a helper function to reset the cache at the beginning of any GPU-enabled steps. For example, something as simple as this might suffice:
You can then call this function at the beginning of your GPU-enabled steps:
Note that resetting the memory cache will potentially affect others using the same GPU, so use this judiciously.
Train across multiple GPUs
ZenML supports training your models with multiple GPUs on a single node. This is useful if you have a large dataset and want to train your model in parallel. The most important thing that you'll have to handle is preventing multiple ZenML instances from being spawned as you split the work among multiple GPUs.
In practice this will probably involve:
creating a script / Python function that contains the logic of training your model (with the specification that this should run in parallel across multiple GPUs)
calling that script / external function from within the step, possibly with some wrapper or helper code to dynamically configure or update the external script function
We're aware that this is not the most elegant solution and we're at work to implement a better option with some inbuilt support for this task. If this is something you're struggling with and need support getting the step code working, please do connect with us on Slack and we'll do our best to help you out.
Last updated