Deepchecks
How to test the data and models used in your pipelines with Deepchecks test suites
When would you want to use it?
Deepchecks works with both tabular data and computer vision data (currently in beta). For tabular, the supported dataset format is pandas.DataFrame
and the supported model format is sklearn.base.ClassifierMixin
. For computer vision, the supported dataset format is torch.utils.data.dataloader.DataLoader
and supported model format is torch.nn.Module
.
You should use the Deepchecks Data Validator when you need the following data and/or model validation features that are possible with Deepchecks:
How do you deploy it?
The Deepchecks Data Validator flavor is included in the Deepchecks ZenML integration, you need to install it on your local machine to be able to register a Deepchecks Data Validator and add it to your stack:
The Data Validator stack component does not have any configuration parameters. Adding it to a stack is as simple as running e.g.:
How do you use it?
The ZenML integration restructures the way Deepchecks validation checks are organized in four categories, based on the type and number of input parameters that they expect as input. This makes it easier to reason about them when you decide which tests to use in your pipeline steps:
A notable characteristic of Deepchecks is that you don't need to customize the set of Deepchecks tests that are part of a test suite. Both ZenML and Deepchecks provide sane defaults that will run all available Deepchecks tests in a given category with their default conditions if a custom list of tests and conditions are not provided.
There are three ways you can use Deepchecks in your ZenML pipelines that allow different levels of flexibility:
You can visualize Deepchecks results in Jupyter notebooks or view them directly in the ZenML dashboard.
Warning! Usage in remote orchestrators
While these binaries might be available on most operating systems out of the box (and therefore not a problem with the default local orchestrator), we need to tell ZenML to add them to the containerization step when running in remote settings. Here is how:
First, create a file called deepchecks-zenml.Dockerfile
and place it on the same level as your runner script (commonly called run.py
). The contents of the Dockerfile are as follows:
From here on, you can continue to use the deepchecks integration as is explained below.
The Deepchecks standard steps
ZenML wraps the Deepchecks functionality for tabular data in the form of four standard steps:
All four standard steps behave similarly regarding the configuration parameters and returned artifacts, with the following differences:
the type and number of input artifacts are different, as mentioned above
This section will only cover how you can use the data integrity step, with a similar usage to be easily inferred for the other three steps.
To instantiate a data integrity step that will run all available Deepchecks data integrity tests with their default configuration, e.g.:
The step can then be inserted into your pipeline where it can take in a dataset, e.g.:
If needed, you can specify a custom list of data integrity Deepchecks tests to be executed by supplying a check_list
argument:
For more customization, the data integrity step also allows for additional keyword arguments to be supplied to be passed transparently to the Deepchecks library:
dataset_kwargs
: Additional keyword arguments to be passed to the Deepcheckstabular.Dataset
orvision.VisionData
constructor. This is used to pass additional information about how the data is structured, e.g.:check_kwargs
: Additional keyword arguments to be passed to the Deepchecks check object constructors. Arguments are grouped for each check and indexed using the full check class name or check enum value as dictionary keys, e.g.:run_kwargs
: Additional keyword arguments to be passed to the Deepchecks Suiterun
method.
is equivalent to running the following Deepchecks tests:
The Deepchecks Data Validator
The Deepchecks Data Validator implements the same interface as do all Data Validators, so this method forces you to maintain some level of compatibility with the overall Data Validator abstraction, which guarantees an easier migration in case you decide to switch to another Data Validator.
All you have to do is call the Deepchecks Data Validator methods when you need to interact with Deepchecks to run tests, e.g.:
Call Deepchecks directly
You can use the Deepchecks library directly in your custom pipeline steps, and only leverage ZenML's capability of serializing, versioning and storing the SuiteResult
objects in its Artifact Store, e.g.:
Visualizing Deepchecks Suite Results
You can view visualizations of the suites and results generated by your pipeline steps directly in the ZenML dashboard by clicking on the respective artifact in the pipeline run DAG.
Last updated