Integration with Git

Versioning custom code

We are not looking to reinvent the wheel, and we’re not trying to interfere too much with established workflows. When it comes to versioning of code, that means a solid integration into Git.

In short: ZenML optionally uses Git SHAs to resolve your version-pinned pipeline code.

When you add custom code to ZenML, you have the ability to specify a specific Git SHA for your code. ZenML ties into your local Git history and will automatically try to resolve the SHA into usable code. Every pipeline configuration will persist the combination of the class used, and the related SHA in the pipeline config.

Hint

The format used is: class@git_sha, where:

  • class: a fully-qualified python import path of a ZenML-compatible class, e.g. my_module.my_class.MyClassName

  • git_sha (optional): a 40-digit string representing the commit git sha at which the class exists

You can, of course, run your code as-is and maintain version control via your own logic and your own automation. This is why the git_sha above is optional: If you run a pipeline where the class is not committed (i.e. unstaged or staged but not committed), then no git_sha is added to the config. In this case, each time the pipeline is run or loaded the class is loaded as is from the class path, directly from the working tree’s current state.

Warning

While it is faster to just keep running pipelines with un-pinned classes, each un-pinned class adds a technical debt to the ZenML repository. This is because there are no guarantees of reproducibility once a pipeline has a class that is un-pinned. We strongly advice to always commit all code before running pipelines.

Versioning built-in methods

Since ZenML comes with a lot of batteries included, and as ZenML is undergoing rapid development, we’re providing a way to version built-in methods, too.

Specifying the version of a built-in method will be persisted in the pipeline config as step.path@zenml_0.1.0. E.g. zenml.core.steps.data.bq_data_step.BQDataStep@zenml_0.1.4

Under the hood

When running a version-pinned piece of code, ZenML loads all SHA-pinned classes from your git history into memory. This is done via an - immediately reversed - in-memory checkout of the specified SHA.

Safe-guards with in-memory loading

In order to ensure this is not a destructive operation, ZenML does not allow the in-memory checkout if any of the files in the module folder where the class resides is un-committed. E.g. Attempting to load my_module.step.MyStepClass@sha1 will fail if the my_module.step has any uncommitted files.

Organizing code

It is important to understand that when a pipeline is run, all custom classes used, whether they be Steps, Datasources, or Backends under-go a so-called git-resolution process. This means that wherever there is a custom class referenced in a Pipeline, all files within the module are checked to see if they are committed or not. If they are committed, then the class is successfully pinned with the relevant sha. If they are not, then a warning is thrown but the class is not pinned in the corresponding config. Therefore, it is important to consider not only the file where custom logic resides, but the entire module. This is also the reason that upwards relative imports are not permitted within these class files.

We recommend that users follow our recommendationto structure their ZenML repositories, to avoid any potential Git-related issues.

A concrete example

Let’s say we created a custom TrainerStep like and placed it in our ZenML repository here:

repository
│   requirements.txt
│   pipeline_run.py
│
└───trainers
    │   __init__.py
    |
    └───my_awesome_trainer
        │   my_trainer_step.py
        |   __init__.py

where the contents of my_trainer_step.py are:

from zenml.core.steps.trainer.base_trainer import BaseTrainerStep

class MyAwesomeTrainer(BaseTrainerStep):
    def run_fn(self, *args, **kwargs):
        a = 1
        # create a great trainer here.

If we commit everything and then run a pipeline like so:

from zenml.core.pipelines.training_pipeline import TrainingPipeline
from trainers.my_awesome_trainer.my_trainer_step import MyAwesomeTrainer

training_pipeline = TrainingPipeline(name='My Awesome Pipeline')

# Fill in other steps

# Add a trainer
training_pipeline.add_trainer(FeedForwardTrainer(
    loss='binary_crossentropy',
    last_activation='sigmoid',
    output_units=1,
    metrics=['accuracy'],
    epochs=20))

training_pipeline.run()

Then the corresponding pipeline YAML may look like:

version: '1'

datasource:
  ...

environment:
  ...

steps:
  training:
    args: {}
    source: trainers.my_awesome_trainer.my_trainer_step.MyAwesomeTrainer@e9448e0abbc6f03252578ca877bc80c94f137edd
  ...

Notice the source key is tagged with the full path to trainer class and the sha e9448e0abbc6f03252578ca877bc80c94f137edd. If we ever load this pipeline or step using e.g. repo.get_pipeline_by_name() then the following would happen:

Hint

We change e9448e0abbc6f03252578ca877bc80c94f137edd to e9448e0a for readability purposes.

  • All files within the directory trainers/my_awesome_trainer/ would be checked to see if committed or not. Only if all files are committed properly would ZenML allow for loading the pipeline.

  • The directory repository/trainers/my_awesome_trainer/ would be checked out to sha e9448e0a. This is achieved by executing git checkout e9448e0a -- trainers/my_awesome_trainer/

  • The module is loaded using the standard Python importlib library.

  • The git checkout is reverted.

This way, all ZenML custom classes can be used in different environments, and reproducibility is ensured.