Finetuning LLMs with ZenML

Finetune LLMs for specific tasks or to improve performance and cost.

So far in our LLMOps journey we've learned how to use RAG with ZenML, how to evaluate our RAG systems, how to use reranking to improve retrieval, and how to finetune embeddings to support and improve our RAG systems. In this section we will explore LLM finetuning itself. So far we've been using APIs like OpenAI and Anthropic, but there are some scenarios where it makes sense to finetune an LLM on your own data. We'll get into those scenarios and how to finetune an LLM in the pages that follow.

While RAG systems are excellent at retrieving and leveraging external knowledge, there are scenarios where finetuning an LLM can provide additional benefits even with a RAG system in place. For example, you might want to finetune an LLM to improve its ability to generate responses in a specific format, to better understand domain-specific terminology and concepts that appear in your retrieved content, or to reduce the length of prompts needed for consistent outputs. Finetuning can also help when you need the model to follow very specific patterns or protocols that would be cumbersome to encode in prompts, or when you want to optimize for latency by reducing the context window needed for good performance.

We'll go through the following steps in this guide:

This guide is slightly different from the others in that we don't follow a specific use case as the model for finetuning LLMs. The actual steps needed to finetune an LLM are not that complex, but the important part is to understand when you might need to finetune an LLM, how to evaluate the performance of what you do as well as decisions around what data to use and so on.

To follow along with the example explained in this guide, please follow the instructions in the llm-lora-finetuning repository where the full code is also available. This code can be run locally (if you have a GPU attached to your machine) or using cloud compute as you prefer.

Last updated