For the complete documentation index, see llms.txt. This page is also available as Markdown.

Databricks Orchestrator

Orchestrating your pipelines to run on Databricks.

Databricks is a unified data analytics platform that combines the best of data warehouses and data lakes to offer an integrated solution for big data processing and machine learning. It provides a collaborative environment for data scientists, data engineers, and business analysts to work together on data projects. Databricks offers optimized performance and scalability for big data workloads.

The Databricks orchestrator is an orchestrator flavor provided by the ZenML databricks integration that allows you to run your pipelines on Databricks. This integration enables you to leverage Databricks' powerful distributed computing capabilities and optimized environment for your ML pipelines within the ZenML framework.

If you only want to run selected steps on Databricks while keeping the overall pipeline on another orchestrator, use the Databricks step operator instead.

When to use it

You should use the Databricks orchestrator if:

  • you're already using Databricks for your data and ML workloads.

  • you want to leverage Databricks' powerful distributed computing capabilities for your ML pipelines.

  • you're looking for a managed solution that integrates well with other Databricks services.

  • you want to take advantage of Databricks' optimization for big data processing and machine learning.

Prerequisites

You will need to do the following to start using the Databricks orchestrator:

  • An active Databricks workspace. See the cloud-specific setup guides:

  • A Databricks account or service account with permission to create and run jobs.

How it works

Databricks How It works Diagram

When you run a pipeline with the Databricks orchestrator, ZenML builds a Python wheel from your project and uploads it to Databricks. ZenML then uses the Databricks SDK to create a job whose tasks mirror your pipeline steps and their upstream dependencies.

The job uses the cluster settings configured on the orchestrator, including Spark version, worker count or autoscaling, node type, and any Spark configuration. When Databricks starts the job, each task installs the uploaded wheel and executes the corresponding ZenML step entrypoint.

The orchestrator keeps uploaded wheel packages under /Workspace/Shared/.zenml after a successful job submission because Databricks job definitions, scheduled runs, and manual re-runs keep referencing those workspace files. Clean up old wheel directories from that workspace path according to your team's retention policy when you no longer need to re-run those jobs.

How to use it

To use the Databricks orchestrator, you first need to register it and add it to your stack. Before registering the orchestrator, you need to install the Databricks integration by running the following command:

This installs the required dependencies, including databricks-sdk. Once the integration is installed, register the orchestrator and configure authentication:

We recommend creating a Databricks service account with the necessary permissions to create and run jobs. You can find more information on how to create a service account here. You can generate a client_id and client_secret for the service account and use them to authenticate with Databricks.

Databricks Service Account Permission

You can now run any ZenML pipeline using the Databricks orchestrator:

Databricks UI

Databricks comes with its own UI that you can use to find further details about your pipeline runs, such as the logs of your steps.

Databricks UI

For any runs executed on Databricks, you can get the URL to the Databricks UI in Python using the following code snippet:

Databricks Run UI

Run pipelines on a schedule

The Databricks orchestrator supports running pipelines on a schedule using its native scheduling capability.

How to schedule a pipeline

How to delete a scheduled pipeline

ZenML creates the Databricks schedule, but you manage its lifecycle in Databricks. To cancel a scheduled Databricks pipeline, delete the schedule in Databricks via the UI or CLI.

Additional configuration

For additional configuration of the Databricks orchestrator, you can pass DatabricksOrchestratorSettings which allows you to change the Spark version, number of workers, node type, autoscale settings, Spark configuration, Spark environment variables, schedule timezone, init scripts, and Docker image settings. Init scripts must use DBFS paths that start with dbfs:/. If you configure Docker registry authentication, provide both docker_image_username and docker_image_password.

Use num_workers for fixed-size clusters. For autoscaling clusters, omit num_workers and set autoscale, for example autoscale=(2, 3).

These settings can then be specified on either pipeline-level or step-level:

Tagging Databricks resources

You can apply tags to Databricks resources for cost allocation, governance, and project tracking using two settings:

  • custom_tags: Applied to the underlying cluster resources (e.g., AWS EC2 instances, EBS volumes). Maximum 45 tags.

  • job_tags: Applied to the Databricks job itself and forwarded as cluster tags. Maximum 25 tags.

By default, Databricks autoscaling uses (0, 1) worker bounds. This intentionally permits driver-only clusters while still allowing one worker when needed.

To use GPU-backed clusters, set spark_version and node_type_id to GPU-enabled values:

With these settings, the orchestrator uses a GPU-enabled Spark version and node type.

Enabling CUDA for GPU-backed hardware

If your steps need CUDA, follow the distributed training guide to configure the required dependencies and runtime settings.

ZenML Scarf

Check out the SDK docs for all configurable attributes and this docs page for more information on how to specify settings.

Last updated

Was this helpful?