Orchestrate on the cloud

Orchestrate using cloud resources.

Until now, we've only run pipelines locally. The next step is to get free from our local machines and transition our pipelines to execute on the cloud. This will enable you to run your MLOps pipelines in a cloud environment, leveraging the scalability and robustness that cloud platforms offer.

In order to do this, we need to get familiar with two more stack components:

  • The orchestrator manages the workflow and execution of your pipelines.

  • The container registry is a storage and content delivery system that holds your Docker container images.

These, along with remote storage, complete a basic cloud stack where our pipeline is entirely running on the cloud.

Starting with a basic cloud stack

The easiest cloud orchestrator to start with is the Skypilot orchestrator running on a public cloud. The advantage of Skypilot is that it simply provisions a VM to execute the pipeline on your cloud provider.

Coupled with Skypilot, we need a mechanism to package your code and ship it to the cloud for Skypilot to do its thing. ZenML uses Docker to achieve this. Every time you run a pipeline with a remote orchestrator, ZenML builds an image for the entire pipeline (and optionally each step of a pipeline depending on your configuration). This image contains the code, requirements, and everything else needed to run the steps of the pipeline in any environment. ZenML then pushes this image to the container registry configured in your stack, and the orchestrator pulls the image when it's ready to execute a step.

To summarize, here is the broad sequence of events that happen when you run a pipeline with such a cloud stack:

  1. The user runs a pipeline on the client machine. This executes the run.py script where ZenML reads the @pipeline function and understands what steps need to be executed.

  2. The client asks the server for the stack info, which returns it with the configuration of the cloud stack.

  3. Based on the stack info and pipeline specification, the client builds and pushes an image to the container registry. The image contains the environment needed to execute the pipeline and the code of the steps.

  4. The client creates a run in the orchestrator. For example, in the case of the Skypilot orchestrator, it creates a virtual machine in the cloud with some commands to pull and run a Docker image from the specified container registry.

  5. The orchestrator pulls the appropriate image from the container registry as it's executing the pipeline (each step has an image).

  6. As each pipeline runs, it stores artifacts physically in the artifact store. Of course, this artifact store needs to be some form of cloud storage.

  7. As each pipeline runs, it reports status back to the ZenML server and optionally queries the server for metadata.

Provisioning and registering a Skypilot orchestrator alongside a container registry

While there are detailed docs on how to set up a Skypilot orchestrator and a container registry on each public cloud, we have put the most relevant details here for convenience:

In order to launch a pipeline on AWS with the SkyPilot orchestrator, the first thing that you need to do is to install the AWS and Skypilot integrations:

zenml integration install aws skypilot_aws -y

Before we start registering any components, there is another step that we have to execute. As we explained in the previous section, components such as orchestrators and container registries often require you to set up the right permissions. In ZenML, this process is simplified with the use of Service Connectors. For this example, we need to use the IAM role authentication method of our AWS service connector:

AWS_PROFILE=<AWS_PROFILE> zenml service-connector register cloud_connector --type aws --auto-configure

Once the service connector is set up, we can register a Skypilot orchestrator:

zenml orchestrator register skypilot_orchestrator -f vm_aws
zenml orchestrator connect skypilot_orchestrator --connector cloud_connector

The next step is to register an AWS container registry. Similar to the orchestrator, we will use our connector as we are setting up the container registry:

zenml container-registry register cloud_container_registry -f aws --uri=<ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com
zenml container-registry connect cloud_container_registry --connector cloud_connector

With the components registered, everything is set up for the next steps.

For more information, you can always check the dedicated Skypilot orchestrator guide.

Having trouble with setting up infrastructure? Try reading the stack deployment section of the docs to gain more insight. If that still doesn't work, join the ZenML community and ask!

Running a pipeline on a cloud stack

Now that we have our orchestrator and container registry registered, we can register a new stack, just like we did in the previous chapter:

zenml stack register minimal_cloud_stack -o skypilot_orchestrator -a cloud_artifact_store -c cloud_container_registry

Now, using the code from the previous chapter, we can run a training pipeline. First, set the minimal cloud stack active:

zenml stack set minimal_cloud_stack

and then, run the training pipeline:

python run.py --training-pipeline

You will notice this time your pipeline behaves differently. After it has built the Docker image with all your code, it will push that image, and run a VM on the cloud. Here is where your pipeline will execute, and the logs will be streamed back to you. So with a few commands, we were able to ship our entire code to the cloud!

Curious to see what other stacks you can create? The Component Guide has an exhaustive list of various artifact stores, container registries, and orchestrators that are integrated with ZenML. Try playing around with more stack components to see how easy it is to switch between MLOps stacks with ZenML.

Last updated