Whylogs
How to collect and visualize statistics to track changes in your pipelines' data with whylogs/WhyLabs profiling.
Whylogs
When would you want to use it?
You should use the whylogs/WhyLabs Data Validator when you need the following data validation features that are possible with whylogs and WhyLabs:
Data Quality: validate data quality in model inputs or in a data pipeline
Data Drift: detect data drift in model input features
Model Drift: Detect training-serving skew, concept drift, and model performance degradation
How do you deploy it?
The whylogs Data Validator flavor is included in the whylogs ZenML integration, you need to install it on your local machine to be able to register a whylogs Data Validator and add it to your stack:
If you don't need to connect to the WhyLabs platform to upload and store the generated whylogs data profiles, the Data Validator stack component does not require any configuration parameters. Adding it to a stack is as simple as running e.g.:
Then, you can register the whylogs Data Validator with WhyLabs logging capabilities as follows:
You'll also need to enable whylabs logging for your custom pipeline steps if you want to upload the whylogs data profiles that they return as artifacts to the WhyLabs platform. This is enabled by default for the standard whylogs step. For custom steps, you can enable WhyLabs logging by setting the upload_to_whylabs
parameter to True
in the step configuration, e.g.:
How do you use it?
Whylogs's profiling functions take in a pandas.DataFrame
dataset generate a DatasetProfileView
object containing all the relevant information extracted from the dataset.
There are three ways you can use whylogs in your ZenML pipelines that allow different levels of flexibility:
The whylogs standard step
ZenML wraps the whylogs/WhyLabs functionality in the form of a standard WhylogsProfilerStep
step. The only field in the step config is a dataset_timestamp
attribute which is only relevant when you upload the profiles to WhyLabs that uses this field to group and merge together profiles belonging to the same dataset. The helper function get_whylogs_profiler_step
used to create an instance of this standard step takes in an optional dataset_id
parameter that is also used only in the context of WhyLabs upload to identify the model in the context of which the profile is uploaded, e.g.:
The step can then be inserted into your pipeline where it can take in a pandas.DataFrame
dataset, e.g.:
The whylogs Data Validator
The whylogs Data Validator implements the same interface as do all Data Validators, so this method forces you to maintain some level of compatibility with the overall Data Validator abstraction, which guarantees an easier migration in case you decide to switch to another Data Validator.
All you have to do is call the whylogs Data Validator methods when you need to interact with whylogs to generate data profiles. You may optionally enable whylabs logging to automatically upload the returned whylogs profile to WhyLabs, e.g.:
Call whylogs directly
You can use the whylogs library directly in your custom pipeline steps, and only leverage ZenML's capability of serializing, versioning and storing the DatasetProfileView
objects in its Artifact Store. You may optionally enable whylabs logging to automatically upload the returned whylogs profile to WhyLabs, e.g.:
Visualizing whylogs Profiles
You can view visualizations of the whylogs profiles generated by your pipeline steps directly in the ZenML dashboard by clicking on the respective artifact in the pipeline run DAG.
Last updated