Data Validators

How to enhance and maintain the quality of your data and the performance of your models with data profiling and validation

This is an older version of the ZenML documentation. To read and view the latest version please visit this up-to-date URL.

Without good data, even the best machine learning models will yield questionable results. A lot of effort goes into ensuring and maintaining data quality not only in the initial stages of model development, but throughout the entire machine learning project lifecycle. Data Validators are a category of ML libraries, tools and frameworks that grant a wide range of features and best practices that should be employed in the ML pipelines to keep data quality in check and to monitor model performance to keep it from degrading over time.

Data profiling, data integrity testing, data and model drift detection are all ways of employing data validation techniques at different points in your ML pipelines where data is concerned: data ingestion, model training and evaluation and online or batch inference. Data profiles and model performance evaluation results can be visualized and analyzed to detect problems and take preventive or correcting actions.

Related concepts:

the Data Validator is an optional type of Stack Component that needs to be registered as part of your ZenML Stack.
Data Validators used in ZenML pipelines usually generate data profiles and data quality check reports that are versioned and stored in the Artifact Store. They can be retrieved and inspected using the post-execution workflow API.

When to use it

Data-centric AI practices are quickly becoming mainstream and using Data Validators are an easy way to incorporate them into your workflow. These are some common cases where you may consider employing the use of Data Validators in your pipelines:

early on, even if it's just to keep a log of the quality state of your data and the performance of your models at different stages of development.
if you have pipelines that regularly ingest new data, you should use data validation to run regular data integrity checks to signal problems before they are propagated downstream.
in continuous training pipelines, you should use data validation techniques to compare new training data against a data reference and to compare the performance of newly trained models against previous ones.
when you have pipelines that automate batch inference or if you regularly collect data used as input in online inference, you should use data validation to run data drift analyses and detect training-serving skew, data drift and model drift.

Data Validator Flavors

Data Validator are optional stack components provided by integrations. The following table lists the currently available Data Validators and summarizes their features and the data types and model types that they can be used with in ZenML pipelines:

Data Validator Validation Features Data Types Model Types Notes Flavor/Integration

Data Validator	Validation Features	Data Types	Model Types	Notes	Flavor/Integration
Deepchecks	data quality data drift model drift model performance	tabular: `pandas.DataFrame` CV: `torch.utils.data.dataloader.DataLoader`	tabular: `sklearn.base.ClassifierMixin` CV: `torch.nn.Module`	Add Deepchecks data and model validation tests to your pipelines	`deepchecks`
Evidently	data quality data drift model drift model performance	tabular: `pandas.DataFrame`	N/A	Use Evidently to generate a variety of data quality and data/model drift reports and visualizations	`evidently`
Great Expectations	data profiling data quality	tabular: `pandas.DataFrame`	N/A	Perform data testing, documentation and profiling with Great Expectations	`great_expectations`
Whylogs/WhyLabs	data drift	tabular: `pandas.DataFrame`	N/A	Generate data profiles with whylogs and upload them to WhyLabs	`whylogs`

Deepchecks

data quality data drift model drift model performance

tabular: pandas.DataFrame CV: torch.utils.data.dataloader.DataLoader

tabular: sklearn.base.ClassifierMixin CV: torch.nn.Module

Add Deepchecks data and model validation tests to your pipelines

deepchecks

Evidently

data quality data drift model drift model performance

tabular: pandas.DataFrame

N/A

Use Evidently to generate a variety of data quality and data/model drift reports and visualizations

evidently

Great Expectations

data profiling data quality

tabular: pandas.DataFrame

N/A

Perform data testing, documentation and profiling with Great Expectations

great_expectations

Whylogs/WhyLabs

data drift

tabular: pandas.DataFrame

N/A

Generate data profiles with whylogs and upload them to WhyLabs

whylogs

If you would like to see the available flavors of Data Validator, you can use the command:

zenml data-validator flavor list

How to use it

Every Data Validator has different data profiling and testing capabilities and uses a slightly different way of analyzing your data and your models, but it generally works as follows:

first, you have to configure and add an Data Validator to your ZenML stack
every integration includes one or more builtin data validation steps that you can add to your pipelines. Of course, you can also use the libraries directly in your own custom pipeline steps and simply return the results (e.g. data profiles, test reports) as artifacts that are versioned and stored by ZenML in its Artifact Store.
you can access the data validation artifacts in subsequent pipeline steps or you can load them in the the post-execution workflow to process them or visualize them as needed.

Consult the documentation for the particular Data Validator flavor that you plan on using or are using in your stack for detailed information about how to use it in your ZenML pipelines.

PreviousDevelop a Custom Secrets Manager NextGreat Expectations

Last updated 6 months ago