Distributed Training with 🤗 Accelerate
Run distributed training with Hugging Face's Accelerate library in ZenML pipelines.
There are several reasons why you might want to scale your machine learning pipelines to utilize distributed training, such as leveraging multiple GPUs or training across multiple nodes. ZenML now integrates with Hugging Face's Accelerate library to make this process seamless and efficient.
Use 🤗 Accelerate in your steps
Some steps in your machine learning pipeline, particularly training steps, can benefit from distributed execution. You can now use the run_with_accelerate
decorator to enable this:
The run_with_accelerate
decorator wraps your step, enabling it to run with Accelerate's distributed training capabilities. It accepts arguments available to accelerate launch
CLI command.
For a complete list of available arguments and more details, refer to the Accelerate CLI documentation.
Configuration
The run_with_accelerate
decorator accepts various arguments to configure your distributed training environment. Some common arguments include:
num_processes
: The number of processes to use for distributed training.cpu
: Whether to force training on CPU.multi_gpu
: Whether to launch distributed GPU training.mixed_precision
: Mixed precision training mode ('no', 'fp16', or 'bf16').
Important Usage Notes
The
run_with_accelerate
decorator can only be used directly on steps using the '@' syntax. Using it as a function inside the pipeline definition is not allowed.Accelerated steps do not support positional arguments. Use keyword arguments when calling your steps.
If
run_with_accelerate
is misused, it will raise aRuntimeError
with a helpful message explaining the correct usage.
To see a full example where Accelerate is used within a ZenML pipeline, check out our llm-lora-finetuning project which leverages the distributed training functionalities while finetuning an LLM.
Ensure your container is Accelerate-ready
To run steps with Accelerate, it's crucial to have the necessary dependencies installed in the environment. This section will guide you on how to configure your environment to utilize Accelerate effectively.
Note that these configuration changes are required for Accelerate to function properly. If you don't update the settings, your steps might run, but they will not leverage distributed training capabilities.
All steps using Accelerate will be executed within a containerized environment. Therefore, you need to make two amendments to your Docker settings for the relevant steps:
1. Specify a CUDA-enabled parent image in your DockerSettings
DockerSettings
For complete details, refer to the containerization page. Here's an example using a CUDA-enabled PyTorch image:
2. Add Accelerate as explicit pip requirements
Ensure that Accelerate is installed in your container:
Train across multiple GPUs
ZenML's Accelerate integration supports training your models with multiple GPUs on a single node or across multiple nodes. This is particularly useful for large datasets or complex models that benefit from parallelization.
In practice, using Accelerate with multiple GPUs involves:
Wrapping your training step with the
run_with_accelerate
function in your pipeline definitionConfiguring the appropriate Accelerate arguments (e.g.,
num_processes
,multi_gpu
)Ensuring your training code is compatible with distributed training (Accelerate handles most of this automatically)
If you're new to distributed training or encountering issues, please connect with us on Slack and we'll be happy to assist you.
By leveraging the Accelerate integration in ZenML, you can easily scale your training processes and make the most of your available hardware resources, all while maintaining the structure and benefits of your ZenML pipelines.
Last updated