AWS Sagemaker Orchestrator
Orchestrating your pipelines to run on Amazon Sagemaker.
When to use it
You should use the Sagemaker orchestrator if:
you're already using AWS.
you're looking for a proven production-grade orchestrator.
you're looking for a UI in which you can track your pipeline runs.
you're looking for a managed solution for running your pipelines.
you're looking for a serverless solution for running your pipelines.
How it works
How to deploy it
The only other thing necessary to use the ZenML Sagemaker orchestrator is enabling the relevant permissions for your particular role.
Infrastructure Deployment
A Sagemaker orchestrator can be deployed directly from the ZenML CLI:
You can pass other configurations specific to the stack components as key-value arguments. If you don't provide a name, a random one is generated for you. For more information about how to work use the CLI for this, please refer to the dedicated documentation section.
How to use it
To use the Sagemaker orchestrator, we need:
The ZenML
aws
ands3
integrations installed. If you haven't done so, run
The local client (whoever is running the pipeline) will also have to have the necessary permissions or roles to be able to launch Sagemaker jobs. (This would be covered by the
AmazonSageMakerFullAccess
policy suggested above.)
There are three ways you can authenticate your orchestrator and link it to the IAM role you have created:
You can now run any ZenML pipeline using the Sagemaker orchestrator:
If all went well, you should now see the following output:
Sagemaker UI
Sagemaker comes with its own UI that you can use to find further details about your pipeline runs, such as the logs of your steps.
To access the Sagemaker Pipelines UI, you will have to launch Sagemaker Studio via the AWS Sagemaker UI. Make sure that you are launching it from within your desired AWS region.
Once the Studio UI has launched, click on the 'Pipeline' button on the left side. From there you can view the pipelines that have been launched via ZenML:
Debugging SageMaker Pipelines
Open the execution,
Click on the failed step in the pipeline graph,
Go to the 'Output' tab to see the error message or to 'Logs' to see the logs.
Search for 'CloudWatch' in the AWS console search bar.
Navigate to 'Logs > Log groups.'
Open the '/aws/sagemaker/ProcessingJobs' log group.
Here, you can find log streams for each step of your SageMaker pipeline executions.
Run pipelines on a schedule
Configuration at pipeline or step level
When running your ZenML pipeline with the Sagemaker orchestrator, the configuration set when configuring the orchestrator as a ZenML component will be used by default. However, it is possible to provide additional configuration at the pipeline or step level. This allows you to run whole pipelines or individual steps with alternative configurations. For example, this allows you to run the training process with a heavier, GPU-enabled instance type, while running other steps with lighter instances.
image_uri
instance_count
sagemaker_session
entrypoint
base_job_name
env
For example, settings can be provided in the following way:
They can then be applied to a step as follows:
For example, if your ZenML component is configured to use ml.c5.xlarge
with 400GB additional storage by default, all steps will use it except for the step above, which will use ml.t3.medium
(for Processing Steps) or ml.m5.xlarge
(for Training Steps) with 30GB additional storage. See the next section for details on how ZenML decides which Sagemaker Step type to use.
Using Warm Pools for your pipelines
To enable Warm Pools, use the SagemakerOrchestratorSettings
class:
This configuration keeps instances warm for 5 minutes after each job completes, allowing subsequent jobs to start faster if initiated within this timeframe. The reduced startup time can be particularly beneficial for iterative development processes or frequently run pipelines.
If you prefer not to use Warm Pools, you can explicitly disable them:
By default, the SageMaker orchestrator uses Training Steps where possible, which can offer performance benefits and better integration with SageMaker's training capabilities. To disable this behavior:
These settings allow you to fine-tune your SageMaker orchestrator configuration, balancing between faster startup times with Warm Pools and more control over resource usage. By optimizing these settings, you can potentially reduce overall pipeline runtime and improve your development workflow efficiency.
S3 data access in ZenML steps
Import: S3 -> job
Importing data can be useful when large datasets are available in S3 for training, for which manual copying can be cumbersome. Sagemaker supports File
(default) and Pipe
mode, with which data is either fully copied before the job starts or piped on the fly. See the Sagemaker documentation referenced above for more information about these modes.
Note that data import and export can be used jointly with processor_args
for maximum flexibility.
A simple example of importing data from S3 to the Sagemaker job is as follows:
In this case, data will be available at /opt/ml/processing/input/data
within the job.
It is also possible to split your input over channels. This can be useful if the dataset is already split in S3, or maybe even located in different buckets.
Here, the data will be available in /opt/ml/processing/input/data/train
, /opt/ml/processing/input/data/val
and /opt/ml/processing/input/data/test
.
Export: job -> S3
Data from within the job (e.g. produced by the training process, or when preprocessing large data) can be exported as well. The structure is highly similar to that of importing data. Copying data to S3 can be configured with output_data_s3_mode
, which supports EndOfJob
(default) and Continuous
.
In the simple case, data in /opt/ml/processing/output/data
will be copied to S3 at the end of a job:
In a more complex case, data in /opt/ml/processing/output/data/metadata
and /opt/ml/processing/output/data/checkpoints
will be written away continuously:
Using multichannel output or output mode except EndOfJob
will make it impossible to use TrainingStep and also Warm Pools. See corresponding section of this document for details.
Enabling CUDA for GPU-backed hardware
Last updated