Great Expectations
How to use Great Expectations to run data quality checks in your pipelines and document the results
When would you want to use it?
You should use the Great Expectations Data Validator when you need the following data validation features that are possible with Great Expectations:
How do you deploy it?
The Great Expectations Data Validator flavor is included in the Great Expectations ZenML integration, you need to install it on your local machine to be able to register a Great Expectations Data Validator and add it to your stack:
Depending on how you configure the Great Expectations Data Validator, it can reduce or even completely eliminate the complexity associated with setting up the store backends for Great Expectations. If you're only looking for a quick and easy way of adding Great Expectations to your stack and are not concerned with the configuration details, you can simply run:
If you already have a Great Expectations deployment, you can configure the Great Expectations Data Validator to reuse or even replace your current configuration. You should consider the pros and cons of every deployment use-case and choose the one that best fits your needs:
let ZenML initialize and manage the Great Expectations configuration. The Artifact Store will serve as a storage backend for all the information that Great Expectations needs to persist (e.g. Expectation Suites, Validation Results). However, you will not be able to setup new Data Sources, Metadata Stores or Data Docs sites. Any changes you try and make to the configuration through code will not be persisted and will be lost when your pipeline completes or your local process exits.
use ZenML with your existing Great Expectations configuration. You can tell ZenML to replace your existing Metadata Stores with the active ZenML Artifact Store by setting the
configure_zenml_stores
attribute in the Data Validator. The downside is that you will only be able to run pipelines locally with this setup, given that the Great Expectations configuration is a file on your local machine.migrate your existing Great Expectations configuration to ZenML. This is a compromise between 1. and 2. that allows you to continue to use your existing Data Sources, Metadata Stores and Data Docs sites even when running pipelines remotely.
Some Great Expectations CLI commands will not work well with the deployment methods that puts ZenML in charge of your Great Expectations configuration (i.e. 1. and 3.). You will be required to use Python code to manage your Expectations and you will have to edit the Jupyter notebooks generated by the Great Expectations CLI to connect them to your ZenML managed configuration. .
Advanced Configuration
The Great Expectations Data Validator has a few advanced configuration attributes that might be useful for your particular use-case:
configure_zenml_stores
: if set, ZenML will automatically update the Great Expectation configuration to include Metadata Stores that use the Artifact Store as a backend. If neithercontext_root_dir
norcontext_config
are set, this is the default behavior. You can set this flag to use the ZenML Artifact Store as a backend for Great Expectations with any of the deployment methods described above. Note that ZenML will not copy the information in your existing Great Expectations stores (e.g. Expectation Suites, Validation Results) in the ZenML Artifact Store. This is something that you will have to do yourself.configure_local_docs
: set this flag to configure a local Data Docs site where Great Expectations docs are generated and can be visualized locally. Use this in case you don't already have a local Data Docs site in your existing Great Expectations configuration.
How do you use it?
The core Great Expectations concepts that you should be aware of when using it within ZenML pipelines are Expectations / Expectation Suites, Validations and Data Docs.
ZenML wraps the Great Expectations' functionality in the form of two standard steps:
a Great Expectations data profiler that can be used to automatically generate Expectation Suites from an input
pandas.DataFrame
dataseta Great Expectations data validator that uses an existing Expectation Suite to validate an input
pandas.DataFrame
dataset
You can visualize Great Expectations Suites and Results in Jupyter notebooks or view them directly in the ZenML dashboard.
The Great Expectation's data profiler step
At a minimum, the step configuration expects a name to be used for the Expectation Suite:
The step can then be inserted into your pipeline where it can take in a pandas dataframe, e.g.:
The Great Expectations data validator step
The standard Great Expectations data validator step validates an input pandas.DataFrame
dataset by running an existing Expectation Suite on it. The validation results are saved in the Great Expectations Validation Store, but also returned as an CheckpointResult
artifact that is versioned and saved in the ZenML Artifact Store. The step automatically rebuilds the Data Docs.
At a minimum, the step configuration expects the name of the Expectation Suite to be used for the validation:
The step can then be inserted into your pipeline where it can take in a pandas dataframe and a bool flag used solely for order reinforcement purposes, e.g.:
Call Great Expectations directly
You can use the Great Expectations library directly in your custom pipeline steps, while leveraging ZenML's capability of serializing, versioning and storing the ExpectationSuite
and CheckpointResult
objects in its Artifact Store. To use the Great Expectations configuration managed by ZenML while interacting with the Great Expectations library directly, you need to use the Data Context managed by ZenML instead of the default one provided by Great Expectations, e.g.:
The same approach must be used if you are using a Great Expectations configuration managed by ZenML and are using the Jupyter notebooks generated by the Great Expectations CLI.
Visualizing Great Expectations Suites and Results
You can view visualizations of the suites and results generated by your pipeline steps directly in the ZenML dashboard by clicking on the respective artifact in the pipeline run DAG.
Last updated