Evaluating finetuned embeddings

Evaluate finetuned embeddings and compare to original base embeddings.

Now that we've finetuned our embeddings, we can evaluate them and compare to the base embeddings. We have all the data saved and versioned already, and we will reuse the same MatryoshkaLoss function for evaluation.

In code, our evaluation steps are easy to comprehend. Here, for example, is the base model evaluation step:

from zenml import log_model_metadata, step

def evaluate_model(
    dataset: DatasetDict, model: SentenceTransformer
) -> Dict[str, float]:
    """Evaluate the given model on the dataset."""
    evaluator = get_evaluator(
        dataset=dataset,
        model=model,
    )
    return evaluator(model)

@step
def evaluate_base_model(
    dataset: DatasetDict,
) -> Annotated[Dict[str, float], "base_model_evaluation_results"]:
    """Evaluate the base model on the given dataset."""
    model = SentenceTransformer(
        EMBEDDINGS_MODEL_ID_BASELINE,
        device="cuda" if torch.cuda.is_available() else "cpu",
    )

    results = evaluate_model(
        dataset=dataset,
        model=model,
    )

    # Convert numpy.float64 values to regular Python floats
    # (needed for serialization)
    base_model_eval = {
        f"dim_{dim}_cosine_ndcg@10": float(
            results[f"dim_{dim}_cosine_ndcg@10"]
        )
        for dim in EMBEDDINGS_MODEL_MATRYOSHKA_DIMS
    }

    log_model_metadata(
        metadata={"base_model_eval": base_model_eval},
    )

    return results

We log the results for our core Matryoshka dimensions as model metadata to ZenML within our evaluation step. This will allow us to inspect these results from within the Model Control Plane (see below for more details). Our results come in the form of a dictionary of string keys and float values which will, like all step inputs and outputs, be versioned, tracked and saved in your artifact store.

Visualizing results

It's possible to visualize results in a few different ways in ZenML, but one easy option is just to output your chart as an PIL.Image object. (See our documentation on more ways to visualize your results.) The rest the implementation of our visualize_results step is just simple matplotlib code to plot out the base model evaluation against the finetuned model evaluation. We represent the results as percentage values and horizontally stack the two sets to make comparison a little easier.

We can see that our finetuned embeddings have improved the recall of our retrieval system across all of the dimensions, but the results are still not amazing. In a production setting, we would likely want to focus on improving the data being used for the embeddings training. In particular, we could consider stripping out some of the logs output from the documentation, and perhaps omit some pages which offer low signal for the retrieval task. This embeddings finetuning was run purely on the full set of synthetic data generated by distilabel and gpt-4o, so we wouldn't necessarily expect to see huge improvements out of the box, especially when the underlying data chunks are complex and contain multiple topics.

Model Control Plane as unified interface

Once all our pipelines are finished running, the best place to inspect our results as well as the artifacts and models we generated is the Model Control Plane.

The interface is split into sections that correspond to:

  • the artifacts generated by our steps

  • the models generated by our steps

  • the metadata logged by our steps

  • (potentially) any deployments of models made, though we didn't use this in this guide so far

  • any pipeline runs associated with this 'Model'

We can easily see which are the latest artifact or technical model versions, as well as compare the actual values of our evals or inspect the hardware or hyperparameters used for training.

This one-stop-shop interface is available on ZenML Pro and you can learn more about it in the Model Control Plane documentation.

Next Steps

Now that we've finetuned our embeddings and evaluated them, when they were in a good shape for use we could bring these into the original RAG pipeline, regenerate a new series of embeddings for our data and then rerun our RAG retrieval evaluations to see how they've improved in our hand-crafted and LLM-powered evaluations.

The next section will cover LLM finetuning and deployment as the final part of our LLMops guide. (This section is currently still a work in progress, but if you're eager to try out LLM finetuning with ZenML, you can use our LoRA project to get started. We also have a blogpost guide which takes you through all the steps you need to finetune Llama 3.1 using GCP's Vertex AI with ZenML, including one-click stack creation!)

To try out the two pipelines, please follow the instructions in the project repository README, and you can find the full code in that same directory.

Last updated