Synthetic data generation
Generate synthetic data with distilabel to finetune embeddings.
Last updated
Generate synthetic data with distilabel to finetune embeddings.
Last updated
We already have a dataset of technical documentation that was generated previously while we were working on the RAG pipeline. We'll use this dataset to generate synthetic data with distilabel
. You can inspect the data directly on the Hugging Face dataset page.
As you can see, it is made up of some page_content
(our chunks) as well as the source URL from where the chunk was taken from. With embeddings, what we're going to want to do is pair the page_content
with a question that we want to answer. In a pre-LLM world we might have actually created a new column and worked to manually craft questions for each chunk. However, with LLMs, we can use the page_content
to generate questions.
Our pipeline to generate synthetic data will look like this:
We'll load the Hugging Face dataset, then we'll use distilabel
to generate the synthetic data. To finish off, we'll push the newly-generated data to a new Hugging Face dataset and also push the same data to our Argilla instance for annotation and inspection.
distilabel
provides a scalable and reliable approach to distilling knowledge from LLMs by generating synthetic data or providing AI feedback with LLMs as judges. We'll be using it a relatively simple use case to generate some queries appropriate to our documentation chunks, but it can be used for a variety of other tasks.
We can set up a distilabel
pipeline easily in our ZenML step to handle the dataset creation. We'll be using gpt-4o
as the LLM to generate the synthetic data so you can follow along, but distilabel
supports a variety of other LLM providers (including Ollama) so you can use whatever you have available.
As you can see, we set up the LLM, create a distilabel
pipeline, load the dataset, mapping the page_content
column so that it becomes anchor
. (This column renaming will make things easier a bit later when we come to finetuning the embeddings.) Then we generate the synthetic data by using the GenerateSentencePair
step. This will create queries for each of the chunks in the dataset, so if the chunk was about registering a ZenML stack, the query might be "How do I register a ZenML stack?". It will also create negative queries, which are queries that would be inappropriate for the chunk. We do this so that the embeddings model can learn to distinguish between appropriate and inappropriate queries.
We add some context to the generation process to help the LLM understand the task and the data we're working with. In particular, we explain that some parts of the text are code snippets and logs. We found performance to be better when we added this context.
When this step runs within ZenML it will handle spinning up the necessary processes to make batched LLM calls to the OpenAI API. This is really useful when working with large datasets. distilabel
has also implemented a caching mechanism to avoid recomputing results for the same inputs. So in this case you have two layers of caching: one in the distilabel
pipeline and one in the ZenML orchestrator. This helps speed up the pace of iteration and saves you money.
Once we've let the LLM generate the synthetic data, we'll want to inspect it and make sure it looks good. We'll do this by pushing the data to an Argilla instance. We add a few extra pieces of metadata to the data to make it easier to navigate and inspect within our data annotation tool. These include:
parent_section
: This will be the section of the documentation that the chunk is from.
token_count
: This will be the number of tokens in the chunk.
similarity-positive-negative
: This will be the cosine similarity between the positive and negative queries.
similarity-anchor-positive
: This will be the cosine similarity between the anchor and positive queries.
similarity-anchor-negative
: This will be the cosine similarity between the anchor and negative queries.
We'll also add the embeddings for the anchor column so that we can use these for retrieval. We'll use the base model (in our case, Snowflake/snowflake-arctic-embed-large
) to generate the embeddings. We use this function to map the dataset and process all the metadata:
The rest of the push_to_argilla
step is just setting up the Argilla dataset and pushing the data to it.
At this point you'd move to Argilla to view the data, see which examples seem to make sense and which don't. You can update the questions (positive and negative) which were generated by the LLM. If you want, you can do some data cleaning and exploration to improve the data quality, perhaps using the similarity metrics that we calculated earlier.
We'll next move to actually finetuning the embeddings, assuming you've done some data exploration and annotation. The code will work even without the annotation, however, since we'll just use the full generated dataset and assume that the quality is good enough.