Perform Drift Detection
Guard against data drift with our Evidently integration.
This is an older version of the ZenML documentation. To check the latest version please visit https://docs.zenml.io
Perform Drift Detection
Data drift is something you often want to guard against in your pipelines. Machine learning pipelines are built on top of data inputs, so it is worth checking for drift if you have a model that was trained on a certain distribution of data. What follows is an example of how we use one drift detection tool that ZenML has currently integrated with. This takes the form of a standard step that you can use to make the relevant calculations.
🗺 Overview
Evidently
is a useful open-source library to painlessly check for data drift (among other features). At its core, Evidently's drift detection takes in a reference data set and compares it against another comparison dataset. These are both input in the form of a Pandas DataFrame
, though CSV inputs are also possible. You can receive these results in the form of a standard dictionary object containing all the relevant information, or as a visualization. ZenML supports both outputs.
ZenML implements this functionality in the form of several standardized steps. You select which of the profile sections you want to use in your step by passing a string into the EvidentlyProfileConfig
. Possible options supported by Evidently are:
"datadrift"
"categoricaltargetdrift"
"numericaltargetdrift"
"classificationmodelperformance"
"regressionmodelperformance"
"probabilisticmodelperformance"
"dataquality" (NOT CURRENTLY IMPLEMENTED)
🧰 How to validate data inside a ZenML step
With Evidently, we compare two separate DataFrames. ZenML provides custom steps which you can set up for drift detection as in the following code:
Here you can see that defining the step is extremely simple using our class-based interface, and then you just have to pass in the two dataframes for the comparison to take place.
This could be done at the point when you are defining your pipeline:
For the full context of this code, please visit our drift_detection
example here. The key part of the pipeline definition above is when we use the datasets derived from the data_splitter
step (i.e. function) and pass them in as arguments to the drift_detector
function as part of the pipeline.
We even allow you to use the Evidently visualization tool easily to display data drift diagrams in your browser or within a Jupyter notebook:
Simple code like this would allow you to access the Evidently visualizer based on the completed pipeline run:
Last updated