Great Expectations

How to use Great Expectations to run data quality checks in your pipelines and document the results

This is an older version of the ZenML documentation. To read and view the latest version please visit this up-to-date URL.

The Great Expectations Data Validator flavor provided with the ZenML integration uses Great Expectations to run data profiling and data quality tests on the data circulated through your pipelines. The test results can be used to implement automated corrective actions in your pipelines. They are also automatically rendered into documentation for further visual interpretation and evaluation.

When would you want to use it?

Great Expectations is an open-source library that helps keep the quality of your data in check through data testing, documentation, and profiling, and to improve communication and observability. Great Expectations works with tabular data in a variety of formats and data sources, of which ZenML currently supports only pandas.DataFrame as part of its pipelines.

You should use the Great Expectations Data Validator when you need the following data validation features that are possible with Great Expectations:

  • Data Profiling: generates a set of validation rules (Expectations) automatically by inferring them from the properties of an input dataset.

  • Data Quality: runs a set of predefined or inferred validation rules (Expectations) against an in-memory dataset.

  • Data Docs: generate and maintain human readable documentation of all your data validation rules, data quality checks and their results.

You should consider one of the other Data Validator flavors if you need a different set of data validation features.

How do you deploy it?

The Great Expectations Data Validator flavor is included in the Great Expectations ZenML integration, you need to install it on your local machine to be able to register a Great Expectations Data Validator and add it to your stack:

zenml integration install great_expectations -y

Depending on how you configure the Great Expectations Data Validator, it can reduce or even completely eliminate the complexity associated with setting up the store backends for Great Expectations. If you're only looking for a quick and easy way of adding Great Expectations to your stack and are not concerned with the configuration details, you can simply run:

# Register the Great Expectations data validator
zenml data-validator register ge_data_validator --flavor=great_expectations

# Register and set a stack with the new data validator
zenml stack register custom_stack -dv ge_data_validator ... --set

If you already have a Great Expectations deployment, you can configure the Great Expectations Data Validator to reuse or even replace your current configuration. You should consider the pros and cons of every deployment use-case and choose the one that best fits your needs:

  1. let ZenML initialize and manage the Great Expectations configuration. The Artifact Store will serve as a storage backend for all the information that Great Expectations needs to persist (e.g. Expectation Suites, Validation Results). However, you will not be able to setup new Data Sources, Metadata Stores or Data Docs sites. Any changes you try and make to the configuration through code will not be persisted and will be lost when your pipeline completes or your local process exits.

  2. use ZenML with your existing Great Expectations configuration. You can tell ZenML to replace your existing Metadata Stores with the active ZenML Artifact Store by setting the configure_zenml_stores attribute in the Data Validator. The downside is that you will only be able to run pipelines locally with this setup, given that the Great Expectations configuration is a file on your local machine.

  3. migrate your existing Great Expectations configuration to ZenML. This is a compromise between 1. and 2. that allows you to continue to use your existing Data Sources, Metadata Stores and Data Docs sites even when running pipelines remotely.

Some Great Expectations CLI commands will not work well with the deployment methods that puts ZenML in charge of your Great Expectations configuration (i.e. 1. and 3.). You will be required to use Python code to manage your Expectations and you will have to edit the Jupyter notebooks generated by the Great Expectations CLI to connect them to your ZenML managed configuration.

The default Data Validator setup plugs Great Expectations directly into the Artifact Store component that is part of the same stack. As a result, the Expectation Suites, Validation Results and Data Docs are stored in the ZenML Artifact Store and you don't have to configure Great Expectations at all, ZenML takes care of that for you:

# Register the Great Expectations data validator
zenml data-validator register ge_data_validator --flavor=great_expectations

# Register and set a stack with the new data validator
zenml stack register custom_stack -dv ge_data_validator ... --set

Advanced Configuration

The Great Expectations Data Validator has a few advanced configuration attributes that might be useful for your particular use-case:

  • configure_zenml_stores: if set, ZenML will automatically update the Great Expectation configuration to include Metadata Stores that use the Artifact Store as a backend. If neither context_root_dir nor context_config are set, this is the default behavior. You can set this flag to use the ZenML Artifact Store as a backend for Great Expectations with any of the deployment methods described above. Note that ZenML will not copy the information in your existing Great Expectations stores (e.g. Expectation Suites, Validation Results) in the ZenML Artifact Store. This is something that you will have to do yourself.

  • configure_local_docs: set this flag to configure a local Data Docs site where Great Expectations docs are generated and can be visualized locally. Use this in case you don't already have a local Data Docs site in your existing Great Expectations configuration.

For more, up-to-date information on the Great Expectations Data Validator configuration, you can have a look at the API docs.

How do you use it?

The core Great Expectations concepts that you should be aware of when using it within ZenML pipelines are Expectations / Expectation Suites, Validations and Data Docs.

ZenML wraps the Great Expectations functionality in the form of two standard steps:

  • a Great Expectations data profiler that can be used to automatically generate Expectation Suites from an input pandas.DataFrame dataset

  • a Great Expectations data validator that uses an existing Expectation Suite to validate an input pandas.DataFrame dataset

Outside of the pipeline workflow, you can use the ZenML Great Expectations visualizer to display the Great Expectations Data Docs pages describing the Expectation Suites and Validation Results generated by your pipelines.

You can also check out our examples pages for working examples that use the Great Expectations standard steps:

The Great Expectations data profiler step

The standard Great Expectations data profiler step builds an Expectation Suite automatically by running a UserConfigurableProfiler on an input pandas.DataFrame dataset. The generated Expectation Suite is saved in the Great Expectations Expectation Store, but also returned as an ExpectationSuite artifact that is versioned and saved in the ZenML Artifact Store. The step automatically rebuilds the Data Docs.

At a minimum, the step configuration expects a name to be used for the Expectation Suite:

from zenml.integrations.great_expectations.steps import (
    GreatExpectationsProfilerConfig,
    great_expectations_profiler_step,
)

# instantiate a builtin Great Expectations data profiling step
ge_profiler_config = GreatExpectationsProfilerConfig(
    expectation_suite_name="breast_cancer_suite",
    data_asset_name="breast_cancer_ref_df",
)
ge_profiler_step = great_expectations_profiler_step(
    step_name="ge_profiler_step",
    config=ge_profiler_config,
)

The step can then be inserted into your pipeline where it can take in a pandas dataframe, e.g.:

@pipeline(required_integrations=[SKLEARN, GREAT_EXPECTATIONS])
def profiling_pipeline(
    importer, profiler
):
    """Data profiling pipeline for Great Expectations.

    The pipeline imports a reference dataset from a source then uses the builtin
    Great Expectations profiler step to generate an expectation suite (i.e.
    validation rules) inferred from the schema and statistical properties of the
    reference dataset.

    Args:
        importer: reference data importer step
        profiler: data profiler step
    """
    dataset, _ = importer()
    profiler(dataset)

profiling_pipeline(
    importer=importer(),
    profiler=ge_profiler_step,
).run()

As can be seen from the step definition, the step takes in a pandas.DataFrame dataset and it returns a Great Expectations ExpectationSuite object:

class GreatExpectationsProfilerStep(BaseStep):
    """Standard Great Expectations profiling step implementation.
    """

    def entrypoint(
        self,
        dataset: pd.DataFrame,
        config: GreatExpectationsProfilerConfig,
    ) -> ExpectationSuite:
        ...

You can view the complete list of configuration parameters in the API docs.

The Great Expectations data validator step

The standard Great Expectations data validator step validates an input pandas.DataFrame dataset by running an existing Expectation Suite on it. The validation results are saved in the Great Expectations Validation Store, but also returned as an CheckpointResult artifact that is versioned and saved in the ZenML Artifact Store. The step automatically rebuilds the Data Docs.

At a minimum, the step configuration expects the name of the Expectation Suite to be used for the validation:

from zenml.integrations.great_expectations.steps import (
    GreatExpectationsValidatorConfig,
    great_expectations_validator_step,
)

# instantiate a builtin Great Expectations data validation step
ge_validator_config = GreatExpectationsValidatorConfig(
    expectation_suite_name="breast_cancer_suite",
    data_asset_name="breast_cancer_test_df",
)
ge_validator_step = great_expectations_validator_step(
    step_name="ge_validator_step",
    config=ge_validator_config,
)

The step can then be inserted into your pipeline where it can take in a pandas dataframe and a bool flag used solely for order reinforcement purposes, e.g.:

@pipeline(required_integrations=[SKLEARN, GREAT_EXPECTATIONS])
def validation_pipeline(
    importer, validator, checker
):
    """Data validation pipeline for Great Expectations.

    The pipeline imports a test data from a source, then uses the builtin
    Great Expectations data validation step to validate the dataset against
    the expectation suite generated in the profiling pipeline.

    Args:
        importer: test data importer step
        validator: dataset validation step
        checker: checks the validation results
    """
    dataset, condition = importer()
    results = validator(dataset, condition)
    message = checker(results)

validation_pipeline(
    importer=importer(),
    validator=ge_validator_step,
    checker=analyze_result(),
).run()

As can be seen from the step definition, the step takes in a pandas.DataFrame dataset and a boolean condition and it returns a Great Expectations CheckpointResult object. The boolean condition is only used as a means of ordering steps in a pipeline (e.g. if you must force it to run only after the data profiling step generates an Expectation Suite):


class GreatExpectationsValidatorStep(BaseStep):
    """Standard Great Expectations data validation step implementation.
    """

    def entrypoint(
        self,
        dataset: pd.DataFrame,
        condition: bool,
        config: GreatExpectationsValidatorConfig,
    ) -> CheckpointResult:
        ...

You can view the complete list of configuration parameters in the API docs.

Call Great Expectations directly

You can use the Great Expectations library directly in your custom pipeline steps, while leveraging ZenML's capability of serializing, versioning and storing the ExpectationSuite and CheckpointResult objects in its Artifact Store. To use the Great Expectations configuration managed by ZenML while interacting with the Great Expectations library directly, you need to use the Data Context managed by ZenML instead of the default one provided by Great Expectations, e.g.:

import great_expectations as ge
from zenml.integrations.great_expectations.data_validators import (
    GreatExpectationsDataValidator
)

import pandas as pd
from great_expectations.core import ExpectationSuite
from zenml.steps import step

@step
def create_custom_expectation_suite(
) -> ExpectationSuite:
    """Custom step that creates an Expectation Suite

    Returns:
        An Expectation Suite
    """
    context = GreatExpectationsDataValidator.get_data_context()
    # instead of:
    # context = ge.get_context()

    expectation_suite_name = "custom_suite"
    suite = context.create_expectation_suite(
        expectation_suite_name=expectation_suite_name
    )
    expectation_configuration = ExpectationConfiguration(...)
    suite.add_expectation(expectation_configuration=expectation_configuration)
    ...
    context.save_expectation_suite(
        expectation_suite=suite,
        expectation_suite_name=expectation_suite_name,
    )
    context.build_data_docs()
    return suite

The same approach must be used if you are using a Great Expectations configuration managed by ZenML and are using the Jupyter notebooks generated by the Great Expectations CLI.

The Great Expectations ZenML Visualizer

In the post-execution workflow, you can view the Expectation Suites and Validation Results generated and returned by your pipeline steps in the Great Expectations Data Docs by means of the ZenML Great Expectations Visualizer, e.g.:

from zenml.integrations.great_expectations.visualizers.ge_visualizer import (
    GreatExpectationsVisualizer,
)
from zenml.repository import Repository

def visualize_results(pipeline_name: str, step_name: str) -> None:
    repo = Repository()
    pipeline = repo.get_pipeline(pipeline_name)
    last_run = pipeline.runs[-1]
    validation_step = last_run.get_step(step=step_name)
    GreatExpectationsVisualizer().visualize(validation_step)

if __name__ == "__main__":
    visualize_results("validation_pipeline", "profiler")
    visualize_results("validation_pipeline", "train_validator")
    visualize_results("validation_pipeline", "test_validator")

The Data Docs pages will be opened as tabs in your browser:

Last updated