Improve retrieval by finetuning embeddings

Finetune embeddings on custom synthetic data to improve retrieval performance.

We previously learned how to use RAG with ZenML to build a production-ready RAG pipeline. In this section, we will explore how to optimize and maintain your embedding models through synthetic data generation and human feedback. So far, we've been using off-the-shelf embeddings, which provide a good baseline and decent performance on standard tasks. However, you can often significantly improve performance by finetuning embeddings on your own domain-specific data.

Our RAG pipeline uses a retrieval-based approach, where it first retrieves the most relevant documents from our vector database, and then uses a language model to generate a response based on those documents. By finetuning our embeddings on a dataset of technical documentation similar to our target domain, we can improve the retrieval step and overall performance of the RAG pipeline.

The work of finetuning embeddings based on synthetic data and human feedback is a multi-step process. We'll go through the following steps:

Besides ZenML, we will do this by using two open source libraries: argilla and distilabel. Both of these libraries focus optimizing model outputs through improving data quality, however, each one of them takes a different approach to tackle the same problem. distilabel provides a scalable and reliable approach to distilling knowledge from LLMs by generating synthetic data or providing AI feedback with LLMs as judges. argilla enables AI engineers and domain experts to collaborate on data projects by allowing them to organize and explore data through within an interactive and engaging UI. Both libraries can be used individually but they work better together. We'll showcase their use via ZenML pipelines.

To follow along with the example explained in this guide, please follow the instructions in the llm-complete-guide repository where the full code is also available. This specific section on embeddings finetuning can be run locally or using cloud compute as you prefer.

PreviousEvaluating reranking performance NextSynthetic data generation

Last updated 8 days ago