Evidently
How to keep your data quality in check and guard against data and model drift with Evidently profiling
The Evidently Data Validator flavor provided with the ZenML integration uses Evidently to perform data quality, data drift, model drift and model performance analyzes, to generate reports and run checks. The reports and check results can be used to implement automated corrective actions in your pipelines or to render interactive representations for further visual interpretation, evaluation and documentation.
When would you want to use it?
Evidently is an open-source library that you can use to monitor and debug machine learning models by analyzing the data that they use through a powerful set of data profiling and visualization features, or to run a variety of data and model validation reports and tests, from data integrity tests that work with a single dataset to model evaluation tests to data drift analysis and model performance comparison tests. All this can be done with minimal configuration input from the user, or customized with specialized conditions that the validation tests should perform.
Evidently currently works with tabular data in pandas.DataFrame
or CSV file formats and can handle both regression and classification tasks.
You should use the Evidently Data Validator when you need the following data and/or model validation features that are possible with Evidently:
Data Quality reports and tests: provides detailed feature statistics and a feature behavior overview for a single dataset. It can also compare any two datasets. E.g. you can use it to compare train and test data, reference and current data, or two subgroups of one dataset.
Data Drift reports and tests: helps detects and explore feature distribution changes in the input data by comparing two datasets with identical schema.
Target Drift reports and tests: helps detect and explore changes in the target function and/or model predictions by comparing two datasets where the target and/or prediction columns are available.
Regression Performance or Classification Performance reports and tests: evaluate the performance of a model by analyzing a single dataset where both the target and prediction columns are available. It can also compare it to the past performance of the same model, or the performance of an alternative model by providing a second dataset.
You should consider one of the other Data Validator flavors if you need a different set of data validation features.
How do you deploy it?
The Evidently Data Validator flavor is included in the Evidently ZenML integration, you need to install it on your local machine to be able to register an Evidently Data Validator and add it to your stack:
The Data Validator stack component does not have any configuration parameters. Adding it to a stack is as simple as running e.g.:
How do you use it?
Data Profiling
Evidently's profiling functions take in a pandas.DataFrame
dataset or a pair of datasets and generate results in the form of a Report
object.
One of Evidently's notable characteristics is that it only requires datasets as input. Even when running model performance comparison analyzes, no model needs to be present. However, that does mean that the input data needs to include additional target
and prediction
columns for some profiling reports and, you have to include additional information about the dataset columns in the form of column mappings. Depending on how your data is structured, you may also need to include additional steps in your pipeline before the data validation step to insert the additional target
and prediction
columns into your data. This may also require interacting with one or more models.
There are three ways you can use Evidently to generate data reports in your ZenML pipelines that allow different levels of flexibility:
instantiate, configure and insert the standard Evidently report step shipped with ZenML into your pipelines. This is the easiest way and the recommended approach.
call the data validation methods provided by the Evidently Data Validator in your custom step implementation. This method allows for more flexibility concerning what can happen in the pipeline step.
use the Evidently library directly in your custom step implementation. This gives you complete freedom in how you are using Evidently's features.
You can visualize Evidently reports in Jupyter notebooks or view them directly in the ZenML dashboard by clicking on the respective artifact in the pipeline run DAG.
The Evidently Report step
ZenML wraps the Evidently data profiling functionality in the form of a standard Evidently report pipeline step that you can simply instantiate and insert in your pipeline. Here you can see how instantiating and configuring the standard Evidently report step can be done:
The configuration shown in the example is the equivalent of running the following Evidently code inside the step:
Let's break this down...
We configure the evidently_report_step
using parameters that you would normally pass to the Evidently Report
object to configure and run an Evidently report. It consists of the following fields:
column_mapping
: This is anEvidentlyColumnMapping
object that is the exact equivalent of theColumnMapping
object in Evidently. It is used to describe the columns in the dataset and how they should be treated (e.g. as categorical, numerical, or text features).metrics
: This is a list ofEvidentlyMetricConfig
objects that are used to configure the metrics that should be used to generate the report in a declarative way. This is the same as configuring themetrics
that go in the EvidentlyReport
.download_nltk_data
: This is a boolean that is used to indicate whether the NLTK data should be downloaded. This is only needed if you are using Evidently reports that handle text data, which require the NLTK data to be downloaded ahead of time.
There are several ways you can reference the Evidently metrics when configuring EvidentlyMetricConfig
items:
by class name: this is the easiest way to reference an Evidently metric. You can use the name of a metric or metric preset class as it appears in the Evidently documentation (e.g.
"DataQualityPreset"
,"DatasetDriftMetric"
).by full class path: you can also use the full Python class path of the metric or metric preset class ( e.g.
"evidently.metric_preset.DataQualityPreset"
,"evidently.metrics.DatasetDriftMetric"
). This is useful if you want to use metrics or metric presets that are not included in Evidently library.by passing in the class itself: you can also import and pass in an Evidently metric or metric preset class itself, e.g.:
As can be seen in the example, there are two basic ways of adding metrics to your Evidently report step configuration:
to add a single metric or metric preset: call
EvidentlyMetricConfig.metric
with an Evidently metric or metric preset class name (or class path or class). The rest of the parameters are the same ones that you would usually pass to the Evidently metric or metric preset class constructor.to generate multiple metrics, similar to calling the Evidently column metric generator: call
EvidentlyMetricConfig.metric_generator
with an Evidently metric or metric preset class name (or class path or class) and a list of column names. The rest of the parameters are the same ones that you would usually pass to the Evidently metric or metric preset class constructor.
The ZenML Evidently report step can then be inserted into your pipeline where it can take in two datasets and outputs the Evidently report generated in both JSON and HTML formats, e.g.:
For a version of the same step that works with a single dataset, simply don't pass any comparison dataset:
You should consult the official Evidently documentation for more information on what each metric is useful for and what data columns it requires as input.
The evidently_report_step
step also allows for additional Report options to be passed to the Report
constructor e.g.:
You can view the complete list of configuration parameters in the SDK docs.
Data Validation
Aside from data profiling, Evidently can also be used to configure and run automated data validation tests on your data.
Similar to using Evidently through ZenML to run data profiling, there are three ways you can use Evidently to run data validation tests in your ZenML pipelines that allow different levels of flexibility:
instantiate, configure and insert the standard Evidently test step shipped with ZenML into your pipelines. This is the easiest way and the recommended approach.
call the data validation methods provided by the Evidently Data Validator in your custom step implementation. This method allows for more flexibility concerning what can happen in the pipeline step.
use the Evidently library directly in your custom step implementation. This gives you complete freedom in how you are using Evidently's features.
You can visualize Evidently reports in Jupyter notebooks or view them directly in the ZenML dashboard by clicking on the respective artifact in the pipeline run DAG.
You can visualize Evidently reports in Jupyter notebooks or view them directly in the ZenML dashboard by clicking on the respective artifact in the pipeline run DAG.
ZenML wraps the Evidently data validation functionality in the form of a standard Evidently test pipeline step that you can simply instantiate and insert in your pipeline. Here you can see how instantiating and configuring the standard Evidently test step can be done using our included evidently_test_step
utility function:
The configuration shown in the example is the equivalent of running the following Evidently code inside the step:
Let's break this down...
We configure the evidently_test_step
using parameters that you would normally pass to the Evidently TestSuite
object to configure and run an Evidently test suite . It consists of the following fields:
column_mapping
: This is anEvidentlyColumnMapping
object that is the exact equivalent of theColumnMapping
object in Evidently. It is used to describe the columns in the dataset and how they should be treated (e.g. as categorical, numerical, or text features).tests
: This is a list ofEvidentlyTestConfig
objects that are used to configure the tests that will be run as part of your test suite in a declarative way. This is the same as configuring thetests
that go in the EvidentlyTestSuite
.download_nltk_data
: This is a boolean that is used to indicate whether the NLTK data should be downloaded. This is only needed if you are using Evidently tests or test presets that handle text data, which require the NLTK data to be downloaded ahead of time.
There are several ways you can reference the Evidently tests when configuring EvidentlyTestConfig
items, similar to how you reference them in an EvidentlyMetricConfig
object:
by class name: this is the easiest way to reference an Evidently test. You can use the name of a test or test preset class as it appears in the Evidently documentation (e.g.
"DataQualityTestPreset"
,"TestColumnRegExp"
).by full class path: you can also use the full Python class path of the test or test preset class ( e.g.
"evidently.test_preset.DataQualityTestPreset"
,"evidently.tests.TestColumnRegExp"
). This is useful if you want to use tests or test presets that are not included in Evidently library.by passing in the class itself: you can also import and pass in an Evidently test or test preset class itself, e.g.:
As can be seen in the example, there are two basic ways of adding tests to your Evidently test step configuration:
to add a single test or test preset: call
EvidentlyTestConfig.test
with an Evidently test or test preset class name (or class path or class). The rest of the parameters are the same ones that you would usually pass to the Evidently test or test preset class constructor.to generate multiple tests, similar to calling the Evidently column test generator: call
EvidentlyTestConfig.test_generator
with an Evidently test or test preset class name (or class path or class) and a list of column names. The rest of the parameters are the same ones that you would usually pass to the Evidently test or test preset class constructor.
The ZenML Evidently test step can then be inserted into your pipeline where it can take in two datasets and outputs the Evidently test suite results generated in both JSON and HTML formats, e.g.:
For a version of the same step that works with a single dataset, simply don't pass any comparison dataset:
You should consult the official Evidently documentation for more information on what each test is useful for and what data columns it requires as input.
The evidently_test_step
step also allows for additional Test options to be passed to the TestSuite
constructor e.g.:
You can view the complete list of configuration parameters in the SDK docs.
The Evidently Data Validator
The Evidently Data Validator implements the same interface as do all Data Validators, so this method forces you to maintain some level of compatibility with the overall Data Validator abstraction, which guarantees an easier migration in case you decide to switch to another Data Validator.
All you have to do is call the Evidently Data Validator methods when you need to interact with Evidently to generate data reports or to run test suites, e.g.:
Have a look at the complete list of methods and parameters available in the EvidentlyDataValidator
API in the SDK docs.
Call Evidently directly
You can use the Evidently library directly in your custom pipeline steps, e.g.:
Visualizing Evidently Reports
You can view visualizations of the Evidently reports generated by your pipeline steps directly in the ZenML dashboard by clicking on the respective artifact in the pipeline run DAG.
Alternatively, if you are running inside a Jupyter notebook, you can load and render the reports using the artifact.visualize() method, e.g.:
Last updated