Creating your first pipeline

This is the notebook version of our Quickstart! Our goal here is to help you to get the first practical experience with our tool and give you a brief overview on some basic functionalities of ZenML.

In this example, we will create and run a simple pipeline featuring a local CSV dataset and a basic feedforward neural network and run it in our local environment. If you want to run this notebook in an interactive environment, feel free to run it in a Google Colab

First things first…

You can install ZenML through:

pip install zenml

Once the installation is completed, you can go ahead and create your first ZenML repository for your project. As ZenML repositories are built on top of Git repositories, you can create yours in a desired empty directory through:

git init
zenml init

Now, the setup is completed. For the next steps, just make sure that you are executing the code within your ZenML repository.

Creating the pipeline

Once you set everything up, we can start our tutorial. The first step is to create an instance of a pipeline. ZenML comes equipped with different types of pipelines, but for this example we will be using the most classic one, namely a TrainingPipeline.

While creating your pipeline, you can give it a name and use that name to reference the pipeline later.

from zenml.core.pipelines.training_pipeline import TrainingPipeline

training_pipeline = TrainingPipeline(name='QuickstartPipeline')

In a ZenML TrainingPipeline, there is a fixed set of steps representing the processes, which can be found in any machine learning workflow. These steps include:

  1. Split: responsible for splitting your dataset into smaller datasets such as train, eval, etc.

  2. Transform: responsible for the preprocessing of your data

  3. Train: responsible for the model creation and training process

  4. Evaluate: responsible for the evaluation of your results

Creating a datasource

However, before we dive into the aforementioned steps, let’s briefly talk about our dataset.

For this quickstart, we will be using the Pima Indians Diabetes Dataset and on it, we will train a model which will aim to predict whether a person has diabetes based on diagnostic measures.

In order to be able to use this dataset (which is currently in CSV format) in your ZenML pipeline, we first need to create a datasource. ZenML has built-in support for various types of datasources and for this example you can use the CSVDatasource. All you need to provide is a name for the datasource and the path to the CSV file.

from zenml.core.datasources.csv_datasource import CSVDatasource

ds = CSVDatasource(name='Pima Indians Diabetes Dataset', 
                   path='gs://zenml_quickstart/diabetes.csv')

Once you are through, you will have created a tracked and versioned datasource and you can use this datasource in any pipeline. Go ahead and add it to your pipeline.

training_pipeline.add_datasource(ds)

Configuring the split

Now, let us get back to the four essential steps where the first step is the Split.

For the sake of simplicity in this tutorial, we will be using a completely random 70-30 split into a train and evaluation dataset.

from zenml.core.steps.split.random_split import RandomSplit

training_pipeline.add_split(RandomSplit(split_map={'train': 0.7, 
                                                   'eval': 0.3}))

Keep in mind, in a more complicated example, it might be necessary to apply a different splitting strategy. For these cases, you can use the other built-in split configuration ZenML offers or even implement your own custom logic into the split step.

Handling data preprocessing

The next step is to configure the step Transform, the data preprocessing.

For this example, we will use the built-in StandardPreprocesser. It handles the feature selection and has sane defaults of preprocessing behaviour for each data type, such as stardardization for numerical features or vocabularization for non-numerical features.

In order to use it, you need to provide a list of feature names and a list of label names. Moreover, if you do not want it use the default transformation for a feature or you want to overwrite it with a different preprocessing method, this is also possible as we do in this example.

from zenml.core.steps.preprocesser.standard_preprocesser.standard_preprocesser import StandardPreprocesser

training_pipeline.add_preprocesser(
    StandardPreprocesser(
        features=['times_pregnant', 
                  'pgc', 
                  'dbp', 
                  'tst', 
                  'insulin', 
                  'bmi',
                  'pedigree', 
                  'age'],
        labels=['has_diabetes'],
        overwrite={'has_diabetes': {
            'transform': [{'method': 'no_transform', 
                           'parameters': {}}]}}))

Much like the splitting process, you might want to work on cases, where the capabilities of the StandardPreprocesser do not match your task at hand. In this case, you can create your own custom preprocessing step, but we will go into that topic in a different tutorial.

Training your model

As the data is now ready, we can move onto the step Train, the model creation and training.

For this quickstart, we will be using the simple built-in FeedForwardTrainer step and as the name suggests, it represents a feedforward neural network, which is configurable through a set of variables.

from zenml.core.steps.trainer.tensorflow_trainers.tf_ff_trainer import FeedForwardTrainer

training_pipeline.add_trainer(FeedForwardTrainer(loss='binary_crossentropy',
                                                 last_activation='sigmoid',
                                                 output_units=1,
                                                 metrics=['accuracy'],
                                                 epochs=20))

Of course, not every single machine learning problem is solvable by a simple feedforward neural network and most of the time, they will require a model which is tailored to the corresponding problem. That is why we created an interface where the users can implement their own custom models and integrate it in a trainer step. However this approach is not within the scope of this tutorial and you can learn more about it in our docs and the upcoming tutorials.

Evaluation of the results

The last step to configure in our pipeline is the Evaluate.

For this example, we will be using the built-in TFMAEvaluator which uses Tensorflow Model Analysis to compute metrics based on your results (possibly within slices).

from zenml.core.steps.evaluator.tfma_evaluator import TFMAEvaluator

training_pipeline.add_evaluator(
    TFMAEvaluator(slices=[['has_diabetes']],
                  metrics={'has_diabetes': ['binary_crossentropy',
                                            'binary_accuracy']}))

Running your pipeline

Now that everything is set, go ahead and run the pipeline, thus your steps.

training_pipeline.run()

With the execution of the pipeline, you should see the logs informing you about each step along the way. In more detail, you should first see that your dataset will is ingested through the component DataGen and then split by the component SplitGen. Afterwards data preprocessing will take place with the component Transform and will lead to the main training component Trainer. Ultimately, the results will be evaluated by the component Evaluator.

Post-training functionalities

Once the training pipeline is finished, you can check the outputs of your pipeline in different ways.

Dataset

As the data is now ingested, you can go ahead and take a peek into your dataset. You can achieve this by simply getting the datasources registered to your repository and calling the method sample_data.

from zenml.core.repo.repo import Repository

repo = Repository.get_instance()
datasources = repo.get_datasources()

datasources[0].sample_data()

Statistics

Furthermore, you can check the statistics which are yielded by your datasource and split configuration through the method view_statistics. By using the magic flag, we can even achieve this right here in this notebook.

training_pipeline.view_statistics(magic=True)

Evaluate

On the other hand, if you want to evalaute the results of your training process you can use the evaluate method of your pipeline.

Much like the view_statistics, if you execute evaluate with the magic flag, it will help you continue in this notebook and generate two new cells, each set up with a different evaluation tool:

  1. Tensorboard can help you to understand the behaviour of your model during the training session

  2. TFMA or tensorflow_model_analysis can help you assess your already trained model based on given metrics and slices on the evaluation dataset

Note: if you want to see the sliced results, comment in the last line and adjust it according to the slicing column. In the end it should look like this:

tfma.view.render_slicing_metrics(evaluation, slicing_column='has_diabetes')
training_pipeline.evaluate(magic=True)

… and this it it for the quickstart. If you came here without a hiccup, you must have successly installed ZenML, set up a ZenML repo, registered a new datasource, configured a training pipeline, executed it locally and evaluated the results. And, this is just the tip of the iceberg on the capabilities of ZenML.

However, if you had a hiccup or you have some suggestions/questions regarding our framework, you can always check our docs or our github or even better join us on our Slack channel.

Cheers!