Great Expectations
How to use Great Expectations to run data quality checks in your pipelines and document the results
The Great Expectations Data Validator flavor provided with the ZenML integration uses Great Expectations to run data profiling and data quality tests on the data circulated through your pipelines. The test results can be used to implement automated corrective actions in your pipelines. They are also automatically rendered into documentation for further visual interpretation and evaluation.
When would you want to use it?
Great Expectations is an open-source library that helps keep the quality of your data in check through data testing, documentation, and profiling, and to improve communication and observability. Great Expectations works with tabular data in a variety of formats and data sources, of which ZenML currently supports only pandas.DataFrame
as part of its pipelines.
You should use the Great Expectations Data Validator when you need the following data validation features that are possible with Great Expectations:
Data Profiling: generates a set of validation rules (Expectations) automatically by inferring them from the properties of an input dataset.
Data Quality: runs a set of predefined or inferred validation rules (Expectations) against an in-memory dataset.
Data Docs: generate and maintain human-readable documentation of all your data validation rules, data quality checks and their results.
You should consider one of the other Data Validator flavors if you need a different set of data validation features.
How do you deploy it?
The Great Expectations Data Validator flavor is included in the Great Expectations ZenML integration, you need to install it on your local machine to be able to register a Great Expectations Data Validator and add it to your stack:
Depending on how you configure the Great Expectations Data Validator, it can reduce or even completely eliminate the complexity associated with setting up the store backends for Great Expectations. If you're only looking for a quick and easy way of adding Great Expectations to your stack and are not concerned with the configuration details, you can simply run:
If you already have a Great Expectations deployment, you can configure the Great Expectations Data Validator to reuse or even replace your current configuration. You should consider the pros and cons of every deployment use-case and choose the one that best fits your needs:
let ZenML initialize and manage the Great Expectations configuration. The Artifact Store will serve as a storage backend for all the information that Great Expectations needs to persist (e.g. Expectation Suites, Validation Results). However, you will not be able to setup new Data Sources, Metadata Stores or Data Docs sites. Any changes you try and make to the configuration through code will not be persisted and will be lost when your pipeline completes or your local process exits.
use ZenML with your existing Great Expectations configuration. You can tell ZenML to replace your existing Metadata Stores with the active ZenML Artifact Store by setting the
configure_zenml_stores
attribute in the Data Validator. The downside is that you will only be able to run pipelines locally with this setup, given that the Great Expectations configuration is a file on your local machine.migrate your existing Great Expectations configuration to ZenML. This is a compromise between 1. and 2. that allows you to continue to use your existing Data Sources, Metadata Stores and Data Docs sites even when running pipelines remotely.
Some Great Expectations CLI commands will not work well with the deployment methods that puts ZenML in charge of your Great Expectations configuration (i.e. 1. and 3.). You will be required to use Python code to manage your Expectations and you will have to edit the Jupyter notebooks generated by the Great Expectations CLI to connect them to your ZenML managed configuration. .
The default Data Validator setup plugs Great Expectations directly into the Artifact Store component that is part of the same stack. As a result, the Expectation Suites, Validation Results and Data Docs are stored in the ZenML Artifact Store and you don't have to configure Great Expectations at all, ZenML takes care of that for you:
Advanced Configuration
The Great Expectations Data Validator has a few advanced configuration attributes that might be useful for your particular use-case:
configure_zenml_stores
: if set, ZenML will automatically update the Great Expectation configuration to include Metadata Stores that use the Artifact Store as a backend. If neithercontext_root_dir
norcontext_config
are set, this is the default behavior. You can set this flag to use the ZenML Artifact Store as a backend for Great Expectations with any of the deployment methods described above. Note that ZenML will not copy the information in your existing Great Expectations stores (e.g. Expectation Suites, Validation Results) in the ZenML Artifact Store. This is something that you will have to do yourself.configure_local_docs
: set this flag to configure a local Data Docs site where Great Expectations docs are generated and can be visualized locally. Use this in case you don't already have a local Data Docs site in your existing Great Expectations configuration.
For more, up-to-date information on the Great Expectations Data Validator configuration, you can have a look at the SDK docs .
How do you use it?
The core Great Expectations concepts that you should be aware of when using it within ZenML pipelines are Expectations / Expectation Suites, Validations and Data Docs.
ZenML wraps the Great Expectations' functionality in the form of two standard steps:
a Great Expectations data profiler that can be used to automatically generate Expectation Suites from an input
pandas.DataFrame
dataseta Great Expectations data validator that uses an existing Expectation Suite to validate an input
pandas.DataFrame
dataset
You can visualize Great Expectations Suites and Results in Jupyter notebooks or view them directly in the ZenML dashboard.
The Great Expectation's data profiler step
The standard Great Expectation's data profiler step builds an Expectation Suite automatically by running a UserConfigurableProfiler
on an input pandas.DataFrame
dataset. The generated Expectation Suite is saved in the Great Expectations Expectation Store, but also returned as an ExpectationSuite
artifact that is versioned and saved in the ZenML Artifact Store. The step automatically rebuilds the Data Docs.
At a minimum, the step configuration expects a name to be used for the Expectation Suite:
The step can then be inserted into your pipeline where it can take in a pandas dataframe, e.g.:
As can be seen from the step definition , the step takes in a pandas.DataFrame
dataset, and it returns a Great Expectations ExpectationSuite
object:
You can view the complete list of configuration parameters in the SDK docs.
The Great Expectations data validator step
The standard Great Expectations data validator step validates an input pandas.DataFrame
dataset by running an existing Expectation Suite on it. The validation results are saved in the Great Expectations Validation Store, but also returned as an CheckpointResult
artifact that is versioned and saved in the ZenML Artifact Store. The step automatically rebuilds the Data Docs.
At a minimum, the step configuration expects the name of the Expectation Suite to be used for the validation:
The step can then be inserted into your pipeline where it can take in a pandas dataframe and a bool flag used solely for order reinforcement purposes, e.g.:
As can be seen from the step definition , the step takes in a pandas.DataFrame
dataset and a boolean condition
and it returns a Great Expectations CheckpointResult
object. The boolean condition
is only used as a means of ordering steps in a pipeline (e.g. if you must force it to run only after the data profiling step generates an Expectation Suite):
You can view the complete list of configuration parameters in the SDK docs.
Call Great Expectations directly
You can use the Great Expectations library directly in your custom pipeline steps, while leveraging ZenML's capability of serializing, versioning and storing the ExpectationSuite
and CheckpointResult
objects in its Artifact Store. To use the Great Expectations configuration managed by ZenML while interacting with the Great Expectations library directly, you need to use the Data Context managed by ZenML instead of the default one provided by Great Expectations, e.g.:
The same approach must be used if you are using a Great Expectations configuration managed by ZenML and are using the Jupyter notebooks generated by the Great Expectations CLI.
Visualizing Great Expectations Suites and Results
You can view visualizations of the suites and results generated by your pipeline steps directly in the ZenML dashboard by clicking on the respective artifact in the pipeline run DAG.
Alternatively, if you are running inside a Jupyter notebook, you can load and render the suites and results using the artifact.visualize()
method, e.g.:
Last updated