Evaluating finetuned embeddings
Evaluate finetuned embeddings and compare to original base embeddings.
Now that we've finetuned our embeddings, we can evaluate them and compare to the base embeddings. We have all the data saved and versioned already, and we will reuse the same MatryoshkaLoss function for evaluation.
In code, our evaluation steps are easy to comprehend. Here, for example, is the base model evaluation step:
Visualizing results
We can see that our finetuned embeddings have improved the recall of our retrieval system across all of the dimensions, but the results are still not amazing. In a production setting, we would likely want to focus on improving the data being used for the embeddings training. In particular, we could consider stripping out some of the logs output from the documentation, and perhaps omit some pages which offer low signal for the retrieval task. This embeddings finetuning was run purely on the full set of synthetic data generated by distilabel
and gpt-4o
, so we wouldn't necessarily expect to see huge improvements out of the box, especially when the underlying data chunks are complex and contain multiple topics.
Model Control Plane as unified interface
Once all our pipelines are finished running, the best place to inspect our results as well as the artifacts and models we generated is the Model Control Plane.
The interface is split into sections that correspond to:
the artifacts generated by our steps
the models generated by our steps
the metadata logged by our steps
(potentially) any deployments of models made, though we didn't use this in this guide so far
any pipeline runs associated with this 'Model'
We can easily see which are the latest artifact or technical model versions, as well as compare the actual values of our evals or inspect the hardware or hyperparameters used for training.
Next Steps
Last updated