Connect your git repository

Tracking your code and avoiding unnecessary docker builds by connecting your git repo.

A code repository in ZenML refers to a remote storage location for your code. Some commonly known code repository platforms include GitHub and GitLab.

Code repositories enable ZenML to keep track of the code version that you use for your pipeline runs. Additionally, running a pipeline that is tracked in a registered code repository can speed up the Docker image building for containerized stack components by eliminating the need to rebuild Docker images each time you change one of your source code files.

Speeding up Docker builds for containerized components

As discussed before, when using containerized components in your stack, ZenML needs to build Docker images to remotely execute your code. If you're not using a code repository, this code will be included in the Docker images that ZenML builds. This, however, means that new Docker images will be built and pushed whenever you make changes to any of your source files. When running a pipeline that is part of a local code repository checkout, ZenML can instead build the Docker images without including any of your source files, and download the files inside the container before running your code. This greatly speeds up the building process and also allows you to reuse images that one of your colleagues might have built for the same stack.

It is also important to take some additional points into consideration:

  • The file download is only possible if the local checkout is clean (i.e. it does not contain any untracked or uncommitted files) and the latest commit has been pushed to the remote repository. This is necessary as otherwise, the file download inside the Docker container will fail.

  • If you want to disable or enforce the downloading of files, check out this docs page for the available options.

In order to benefit from the advantages of having a code repository in a project, you need to make sure that the relevant integrations are installed for your ZenML installation.

For instance, let's assume you are working on a project with ZenML and one of your team members has already registered a corresponding code repository of type github for it. If you do zenml code-repository list, you would also be able to see this repository. However, in order to fully use this repository, you still need to install the corresponding integration for it, in this example the github integration.

zenml integration install github

Registering a code repository

If you are planning to use one of the available implementations of code repositories, first, you need to install the corresponding ZenML integration:

zenml integration install <INTEGRATION_NAME>

Afterward, code repositories can be registered using the CLI:

zenml code-repository register <NAME> --type=<TYPE> [--CODE_REPOSITORY_OPTIONS]

For concrete options, check out the section on the GitHubCodeRepository, the GitLabCodeRepository or how to develop and register a custom code repository implementation.

Detecting local code repository checkouts

Once you have registered one or more code repositories, ZenML will check whether the files you use when running a pipeline are tracked inside one of those code repositories. This happens as follows:

  • First, the source root is computed

  • Next, ZenML checks whether this source root directory is included in a local checkout of one of the registered code repositories

Tracking code version for pipeline runs

If a local code repository checkout is detected when running a pipeline, ZenML will store a reference to the current commit for the pipeline run so you'll be able to know exactly which code was used. Note that this reference is only tracked if your local checkout is clean (i.e. it does not contain any untracked or uncommitted files). This is to ensure that your pipeline is actually running with the exact code stored at the specific code repository commit.

Available implementations

ZenML comes with builtin implementations of the code repository abstraction for the GitHub and GitLab platforms, but it's also possible to use a custom code repository implementation.

GitHub

ZenML provides built-in support for using GitHub as a code repository for your ZenML pipelines. You can register a GitHub code repository by providing the URL of the GitHub instance, the owner of the repository, the name of the repository, and a GitHub Personal Access Token (PAT) with access to the repository.

Before registering the code repository, first, you have to install the corresponding integration:

zenml integration install github

Afterward, you can register a GitHub code repository by running the following CLI command:

zenml code-repository register <NAME> --type=github \
--url=<GITHUB_URL> --owner=<OWNER> --repository=<REPOSITORY> \
--token=<GITHUB_TOKEN>

where <REPOSITORY> is the name of the code repository you are registering, <OWNER> is the owner of the repository, <NAME> is the name of the repository, <GITHUB_TOKEN> is your GitHub Personal Access Token and <GITHUB_URL> is the URL of the GitHub instance which defaults to https://github.com. You will need to set a URL if you are using GitHub Enterprise.

After registering the GitHub code repository, ZenML will automatically detect if your source files are being tracked by GitHub and store the commit hash for each pipeline run.

How to get a token for GitHub
  1. Go to your GitHub account settings and click on Developer settings.

  2. Select "Personal access tokens" and click on "Generate new token".

  3. Give your token a name and a description.

  4. We recommend selecting the specific repository and then giving contents read-only access.

  5. Click on "Generate token" and copy the token to a safe place.

GitLab

ZenML also provides built-in support for using GitLab as a code repository for your ZenML pipelines. You can register a GitLab code repository by providing the URL of the GitLab project, the group of the project, the name of the project, and a GitLab Personal Access Token (PAT) with access to the project.

Before registering the code repository, first, you have to install the corresponding integration:

zenml integration install gitlab

Afterward, you can register a GitLab code repository by running the following CLI command:

zenml code-repository register <NAME> --type=gitlab \
--url=<GITLAB_URL> --group=<GROUP> --project=<PROJECT> \
--token=<GITLAB_TOKEN>

where <NAME> is the name of the code repository you are registering, <GROUP> is the group of the project, <PROJECT> is the name of the project, <GITLAB_TOKEN> is your GitLab Personal Access Token, and <GITLAB_URL> is the URL of the GitLab instance which defaults to https://gitlab.com. You will need to set a URL if you have a self-hosted GitLab instance.

After registering the GitLab code repository, ZenML will automatically detect if your source files are being tracked by GitLab and store the commit hash for each pipeline run.

How to get a token for GitLab
  1. Go to your GitLab account settings and click on Access Tokens.

  2. Name the token and select the scopes that you need (e.g. read_repository, read_user, read_api)

  3. Click on "Create personal access token" and copy the token to a safe place.

Developing a custom code repository

If you're using some other platform to store your code and you still want to use a code repository in ZenML, you can implement and register a custom code repository.

First, you'll need to subclass and implement the abstract methods of the zenml.code_repositories.BaseCodeRepository class:

class BaseCodeRepository(ABC):
    """Base class for code repositories."""

    @abstractmethod
    def login(self) -> None:
        """Logs into the code repository."""

    @abstractmethod
    def download_files(
            self, commit: str, directory: str, repo_sub_directory: Optional[str]
    ) -> None:
        """Downloads files from the code repository to a local directory.

        Args:
            commit: The commit hash to download files from.
            directory: The directory to download files to.
            repo_sub_directory: The subdirectory in the repository to
                download files from.
        """

    @abstractmethod
    def get_local_context(
            self, path: str
    ) -> Optional["LocalRepositoryContext"]:
        """Gets a local repository context from a path.

        Args:
            path: The path to the local repository.

        Returns:
            The local repository context object.
        """

After you're finished implementing this, you can register it as follows:

# The `CODE_REPOSITORY_OPTIONS` are key-value pairs that your implementation will receive
# as configuration in its __init__ method. This will usually include stuff like the username
# and other credentials necessary to authenticate with the code repository platform.
zenml code-repository register <NAME> --type=custom --source=my_module.MyRepositoryClass \
    [--CODE_REPOSITORY_OPTIONS]

Last updated