# Evaluating finetuned embeddings

Now that we've finetuned our embeddings, we can evaluate them and compare to the base embeddings. We have all the data saved and versioned already, and we will reuse the same MatryoshkaLoss function for evaluation.

In code, our evaluation steps are easy to comprehend. Here, for example, is the base model evaluation step:

```python
from zenml import log_model_metadata, step

def evaluate_model(
    dataset: DatasetDict, model: SentenceTransformer
) -> Dict[str, float]:
    """Evaluate the given model on the dataset."""
    evaluator = get_evaluator(
        dataset=dataset,
        model=model,
    )
    return evaluator(model)

@step
def evaluate_base_model(
    dataset: DatasetDict,
) -> Annotated[Dict[str, float], "base_model_evaluation_results"]:
    """Evaluate the base model on the given dataset."""
    model = SentenceTransformer(
        EMBEDDINGS_MODEL_ID_BASELINE,
        device="cuda" if torch.cuda.is_available() else "cpu",
    )

    results = evaluate_model(
        dataset=dataset,
        model=model,
    )

    # Convert numpy.float64 values to regular Python floats
    # (needed for serialization)
    base_model_eval = {
        f"dim_{dim}_cosine_ndcg@10": float(
            results[f"dim_{dim}_cosine_ndcg@10"]
        )
        for dim in EMBEDDINGS_MODEL_MATRYOSHKA_DIMS
    }

    log_model_metadata(
        metadata={"base_model_eval": base_model_eval},
    )

    return results
```

We log the results for our core Matryoshka dimensions as model metadata to ZenML within our evaluation step. This will allow us to inspect these results from within [the Model Control Plane](https://docs.zenml.io/how-to/model-management-metrics/model-control-plane/) (see below for more details). Our results come in the form of a dictionary of string keys and float values which will, like all step inputs and outputs, be versioned, tracked and saved in your artifact store.

### Visualizing results

It's possible to visualize results in a few different ways in ZenML, but one easy option is just to output your chart as an `PIL.Image` object. (See our[documentation on more ways to visualize your results](https://docs.zenml.io/how-to/data-artifact-management/visualize-artifacts).) The rest the implementation of our `visualize_results` step is just simple `matplotlib` code to plot out the base model evaluation against the finetuned model evaluation. We represent the results as percentage values and horizontally stack the two sets to make comparison a little easier.

![Visualizing finetuned embeddings evaluation results](/files/0zzYU9cCjzCiqDSYD7Cp)

We can see that our finetuned embeddings have improved the recall of our retrieval system across all of the dimensions, but the results are still not amazing. In a production setting, we would likely want to focus on improving the data being used for the embeddings training. In particular, we could consider stripping out some of the logs output from the documentation, and perhaps omit some pages which offer low signal for the retrieval task. This embeddings finetuning was run purely on the full set of synthetic data generated by`distilabel` and `gpt-4o`, so we wouldn't necessarily expect to see huge improvements out of the box, especially when the underlying data chunks are complex and contain multiple topics.

### Model Control Plane as unified interface

Once all our pipelines are finished running, the best place to inspect our results as well as the artifacts and models we generated is the Model Control Plane.

![Model Control Plane](/files/lUcj2sHi3mVOqqaE5IUp)

The interface is split into sections that correspond to:

* the artifacts generated by our steps
* the models generated by our steps
* the metadata logged by our steps
* (potentially) any deployments of models made, though we didn't use this in this guide so far
* any pipeline runs associated with this 'Model'

We can easily see which are the latest artifact or technical model versions, as well as compare the actual values of our evals or inspect the hardware or hyperparameters used for training.

This one-stop-shop interface is available on ZenML Pro and you can learn more about it in the [Model Control Plane documentation](https://docs.zenml.io/how-to/model-management-metrics/model-control-plane/).

### Next Steps

Now that we've finetuned our embeddings and evaluated them, when they were in a good shape for use we could bring these into [the original RAG pipeline](/user-guides/llmops-guide/rag-with-zenml/basic-rag-inference-pipeline.md), regenerate a new series of embeddings for our data and then rerun our RAG retrieval evaluations to see how they've improved in our hand-crafted and LLM-powered evaluations.

The next section will cover [LLM finetuning and deployment](/user-guides/llmops-guide/finetuning-llms.md) as the final part of our LLMops guide. (This section is currently still a work in progress, but if you're eager to try out LLM finetuning with ZenML, you can use[our LoRA project](https://github.com/zenml-io/zenml-projects/blob/main/gamesense/README.md) to get started. We also have [a blogpost](https://www.zenml.io/blog/how-to-finetune-llama-3-1-with-zenml) guide which takes you through[all the steps you need to finetune Llama 3.1](https://www.zenml.io/blog/how-to-finetune-llama-3-1-with-zenml) using GCP's Vertex AI with ZenML, including one-click stack creation!)

To try out the two pipelines, please follow the instructions in [the project repository README](https://github.com/zenml-io/zenml-projects/blob/main/llm-complete-guide/README.md), and you can find the full code in that same directory.

<figure><img src="https://static.scarf.sh/a.png?x-pxid=f0b4f458-0a54-4fcd-aa95-d5ee424815bc" alt="ZenML Scarf"><figcaption></figcaption></figure>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.zenml.io/user-guides/llmops-guide/finetuning-embeddings/evaluating-finetuned-embeddings.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
