Databricks Orchestrator
Orchestrating your pipelines to run on Databricks.
Databricks is a unified data analytics platform that combines the best of data warehouses and data lakes to offer an integrated solution for big data processing and machine learning. It provides a collaborative environment for data scientists, data engineers, and business analysts to work together on data projects. Databricks offers optimized performance and scalability for big data workloads.
The Databricks orchestrator is an orchestrator flavor provided by the ZenML databricks integration that allows you to run your pipelines on Databricks. This integration enables you to leverage Databricks' powerful distributed computing capabilities and optimized environment for your ML pipelines within the ZenML framework.
If you only want to run selected steps on Databricks while keeping the overall pipeline on another orchestrator, use the Databricks step operator instead.
The following features are currently in Alpha and may be subject to change. We recommend using them in a controlled environment and providing feedback to the ZenML team.
When to use it
You should use the Databricks orchestrator if:
you're already using Databricks for your data and ML workloads.
you want to leverage Databricks' powerful distributed computing capabilities for your ML pipelines.
you're looking for a managed solution that integrates well with other Databricks services.
you want to take advantage of Databricks' optimization for big data processing and machine learning.
Prerequisites
You will need to do the following to start using the Databricks orchestrator:
A Databricks account or service account with permission to create and run jobs.
How it works

When you run a pipeline with the Databricks orchestrator, ZenML builds a Python wheel from your project and uploads it to Databricks. ZenML then uses the Databricks SDK to create a job whose tasks mirror your pipeline steps and their upstream dependencies.
The job uses the cluster settings configured on the orchestrator, including Spark version, worker count or autoscaling, node type, and any Spark configuration. When Databricks starts the job, each task installs the uploaded wheel and executes the corresponding ZenML step entrypoint.
The orchestrator keeps uploaded wheel packages under /Workspace/Shared/.zenml after a successful job submission because Databricks job definitions, scheduled runs, and manual re-runs keep referencing those workspace files. Clean up old wheel directories from that workspace path according to your team's retention policy when you no longer need to re-run those jobs.
How to use it
To use the Databricks orchestrator, you first need to register it and add it to your stack. Before registering the orchestrator, you need to install the Databricks integration by running the following command:
This installs the required dependencies, including databricks-sdk. Once the integration is installed, register the orchestrator and configure authentication:
We recommend creating a Databricks service account with the necessary permissions to create and run jobs. You can find more information on how to create a service account here. You can generate a client_id and client_secret for the service account and use them to authenticate with Databricks.

You can now run any ZenML pipeline using the Databricks orchestrator:
Databricks UI
Databricks comes with its own UI that you can use to find further details about your pipeline runs, such as the logs of your steps.

For any runs executed on Databricks, you can get the URL to the Databricks UI in Python using the following code snippet:

Run pipelines on a schedule
The Databricks orchestrator supports running pipelines on a schedule using its native scheduling capability.
How to schedule a pipeline
The Databricks orchestrator only supports the cron_expression, in the Schedule object, and will ignore all other parameters supplied to define the schedule.
The Databricks orchestrator requires an IANA timezone ID to be configured through schedule_timezone in the orchestrator settings (see below for more information on how to set orchestrator settings).
How to delete a scheduled pipeline
ZenML creates the Databricks schedule, but you manage its lifecycle in Databricks. To cancel a scheduled Databricks pipeline, delete the schedule in Databricks via the UI or CLI.
Additional configuration
For additional configuration of the Databricks orchestrator, you can pass DatabricksOrchestratorSettings which allows you to change the Spark version, number of workers, node type, autoscale settings, Spark configuration, Spark environment variables, schedule timezone, init scripts, and Docker image settings. Init scripts must use DBFS paths that start with dbfs:/. If you configure Docker registry authentication, provide both docker_image_username and docker_image_password.
Use num_workers for fixed-size clusters. For autoscaling clusters, omit num_workers and set autoscale, for example autoscale=(2, 3).
These settings can then be specified on either pipeline-level or step-level:
Tagging Databricks resources
You can apply tags to Databricks resources for cost allocation, governance, and project tracking using two settings:
custom_tags: Applied to the underlying cluster resources (e.g., AWS EC2 instances, EBS volumes). Maximum 45 tags.job_tags: Applied to the Databricks job itself and forwarded as cluster tags. Maximum 25 tags.
By default, Databricks autoscaling uses (0, 1) worker bounds. This intentionally permits driver-only clusters while still allowing one worker when needed.
To use GPU-backed clusters, set spark_version and node_type_id to GPU-enabled values:
With these settings, the orchestrator uses a GPU-enabled Spark version and node type.
Enabling CUDA for GPU-backed hardware
If your steps need CUDA, follow the distributed training guide to configure the required dependencies and runtime settings.
Check out the SDK docs for all configurable attributes and this docs page for more information on how to specify settings.
Last updated
Was this helpful?