What is a pipeline?¶
A ZenML pipeline is represented by a standard config written in YAML. A sample config looks like this:
version: '1'
datasource:
id: acaca6c4-bbbe-49a9-bf0f-4b183ee2d9fc
name: My CSV Datasource 11379
source: zenml.core.datasources.csv_datasource.CSVDatasource@zenml_0.0.1rc2
environment:
artifact_store: /path/to/local_store
backends:
orchestrator:
args: {}
type: local
processing:
args: {}
type: local
training:
args: {}
type: local
enable_cache: true
metadata:
args:
uri: /path/to/zenml/.zenml/local_store/metadata.db
type: sqlite
pipeline_name: training_Experiment 1350_4c99bb8f-098d-49f9-a20d-1d37aee67d6c
steps:
data:
args:
path: ...
schema: null
source: zenml.core.steps.data.csv_data_step.CSVDataStep@zenml_0.0.1rc2
preprocessing:
args:
...
source: zenml.core.steps.preprocesser.standard_preprocesser.standard_preprocesser.StandardPreprocesser@zenml_0.0.1rc2
split:
args:
...
source: zenml.core.steps.split.random_split.RandomSplit@zenml_0.0.1rc2
training:
args:
batch_size: 8
dropout_chance: 0.2
epochs: 20
hidden_activation: relu
hidden_layers: null
last_activation: sigmoid
loss: binary_crossentropy
lr: 0.001
metrics:
- accuracy
output_units: 1
source: zenml.core.steps.trainer.feedforward_trainer.FeedForwardTrainer@zenml_0.0.1rc2
evaluation:
args:
...
source: zenml.core.steps.evaluator.tfma_evaluator.TFMAEvaluator@zenml_0.0.1rc2
The config above can be split into 4 distinct keys:
version
: The version of the YAML standard to maintain backwards compatibilitydatasource
: The datasource details to represent the data flowing through the pipelineenvironment
: The configuration of the environment in which pipeline is executed, including details like the Artifact Store, Metadata Store and the Backends used.steps
: The steps used that define what processing the data underwent as it went through the pipeline.
The most interesting key is perhaps the last one, i.e., steps
. Each Step
contains a source
sub-key that points to a git-sha pinned version of the file in which it resides. It also contains all the parameters used in the constructors of these classes, to track them and use them later for comparability and repeatability. Read more about steps here.
Standard Pipelines¶
Currently, there are three standard types of pipelines that should be used for different use-cases for ML in production.
TrainingPipeline: To train a ML model and deploy it
DataPipeline: To create an immutable snapshot/version of a datasource.
BatchInferencePipeline: To run Batch Inference with new data on a trained ML model.
Immutability and Reusing Pipeline Logic¶
After pipelines are run, they are marked as being immutable
. This means that the internal Steps of these pipelines can no longer be changed.
However, a common pattern in Machine Learning is to re-use logical components across the entire lifecycle. And that is after all, the whole purpose of creating steps in the
first place.
In order to re-use logic from another pipeline in ZenML, it is as simple as to execute:
from zenml.core.repo.repo import Repository
# Get a reference in code to the current repo
repo = Repository()
pipeline_a = repo.get_pipeline_by_name('Pipeline A')
# Make a copy with a unique name
pipeline_b = pipeline_a.copy(new_name='Pipeline B')
# Change steps, metadata store, artifact store, backends etc freely.
Ensuring that run pipelines are immutable is crucial to maintain reproducibility in the ZenML design. Using the copy()
paradigm allows
the freedom of re-using steps with ease, and keeps reproducibility intact.
Creating custom pipelines¶
ZenML is designed in a way that the starting point to use it is to create custom Steps
and use them in the Standard
Pipelines defined above. However, there will always be use-cases which do no match these opiniated general Standard pipelines, therefore one can always
create custom pipelines with arbitrary Steps.
The mechanism to create a custom Pipeline will be published in more detail soon in this space. As a teaser, it will involve overriding the BasePipeline
class.
However, the details of this are currently being worked out and will be made available in future releases.
If you need this functionality earlier, then ping us on our Slack or create an issue on GitHub so that we know about it!
Caching¶
The copy()
paradigm also helps in re-usability of code across pipelines. E.g. If now only the TrainerStep is changed in pipeline_b
above,
then the corresponding pipeline_b
pipeline run will skip splitting, preprocessing and re-use all the artifacts already produced by pipeline_a
.
Read more about caching here.
Repository functionalities¶
You can get all your pipelines using the Repository class:
from zenml.core.repo.repo import Repository
repo: Repository = Repository.get_instance()
# Get all names of pipelines in repo
pipeline_names = repo.get_pipeline_names()
# Load previously run pipelines
pipelines = repo.get_pipelines()
# Get pipelines by datasource
pipelines = repo.get_pipelines_by_datasource(ds)
# Get pipelines by type
train_pipelines = repo.get_pipelines_by_type(type_filter=['train'])
Relation to Tensorflow Extended (TFX) pipelines¶
A ZenML pipeline in the current version is a higher-level abstraction of an opinionated TFX pipeline. ZenML Steps are in turn higher-level abstractions of TFX components.
To be clear, currently ZenML is an easier way of defining and running TFX pipelines. However, unlike TFX, ZenML treats pipelines as first-class citizens. We will elaborate more on the difference in this space, but for now if you are coming from writing your own TFX pipelines, our quickstart illustrates the difference well.