Airflow Orchestrator
Orchestrating your pipelines to run on Airflow.
Last updated
Was this helpful?
Orchestrating your pipelines to run on Airflow.
Last updated
Was this helpful?
ZenML pipelines can be executed natively as DAGs. This brings together the power of the Airflow orchestration with the ML-specific benefits of ZenML pipelines. Each ZenML step runs in a separate Docker container which is scheduled and started using Airflow.
If you're going to use a remote deployment of Airflow, you'll also need a .
You should use the Airflow orchestrator if
you're looking for a proven production-grade orchestrator.
you're already using Airflow.
you want to run your pipelines locally.
you're willing to deploy and maintain Airflow.
The Airflow orchestrator can be used to run pipelines locally as well as remotely. In the local case, no additional setup is necessary.
There are many options to use a deployed Airflow server:
Use which includes a component.
Use a managed deployment of Airflow such as , , or .
Deploy Airflow manually. Check out the official for more information.
If you're not using the ZenML GCP Terraform module to deploy Airflow, there are some additional Python packages that you'll need to install in the Python environment of your Airflow server:
pydantic~=2.7.1
: The Airflow DAG files that ZenML creates for you require Pydantic to parse and validate configuration files.
apache-airflow-providers-docker
or apache-airflow-providers-cncf-kubernetes
, depending on which Airflow operator you'll be using to run your pipeline steps. Check out for more information on supported operators.
To use the Airflow orchestrator, we need:
The ZenML airflow
integration installed. If you haven't done so, run
The orchestrator registered and part of our active stack:
Due to dependency conflicts, we need to install the Python packages to start a local Airflow server in a separate Python environment.
Before starting the local Airflow server, we can set a few environment variables to configure it:
AIRFLOW_HOME
: This variable defines the location where the Airflow server stores its database and configuration files. The default value is ~/airflow
.
AIRFLOW__CORE__DAGS_FOLDER
: This variable defines the location where the Airflow server looks for DAG files. The default value is <AIRFLOW_HOME>/dags
.
AIRFLOW__SCHEDULER__DAG_DIR_LIST_INTERVAL
: This variable controls how often the Airflow scheduler checks for new or updated DAGs. By default, the scheduler will check for new DAGs every 30 seconds. This variable can be used to increase or decrease the frequency of the checks, depending on the specific needs of your pipeline.
We can now start the local Airflow server by running the following command:
We can now switch back the Python environment in which ZenML is installed and run a pipeline:
This call will produce a .zip
file containing a representation of your ZenML pipeline for Airflow. The location of this .zip
file will be in the logs of the command above. We now need to copy this file to the Airflow DAGs directory, from where the local Airflow server will load it and run your pipeline (It might take a few seconds until the pipeline shows up in the Airflow UI). To figure out the DAGs directory, we can run airflow config get-value core DAGS_FOLDER
while having our Python environment with the Airflow installation active.
To make this process easier, we can configure our ZenML Airflow orchestrator to automatically copy the .zip
file to this directory for us. To do so, run the following command:
Now that we've set this up, running a pipeline in Airflow is as simple as just running the Python file:
Airflow operators specify how a step in your pipeline gets executed. As ZenML relies on Docker images to run pipeline steps, only operators that support executing a Docker image work in combination with ZenML. Airflow comes with two operators that support this:
the DockerOperator
runs the Docker images for executing your pipeline steps on the same machine that your Airflow server is running on. For this to work, the server environment needs to have the apache-airflow-providers-docker
package installed.
the KubernetesPodOperator
runs the Docker image on a pod in the Kubernetes cluster that the Airflow server is deployed to. For this to work, the server environment needs to have the apache-airflow-providers-cncf-kubernetes
package installed.
You can specify which operator to use and additional arguments to it as follows:
Custom operators
If you want to use any other operator to run your steps, you can specify the operator
in your AirflowSettings
as a path to the python operator class:
Custom DAG generator file
To run a pipeline in Airflow, ZenML creates a Zip archive that contains two files:
A JSON configuration file that the orchestrator creates. This file contains all the information required to create the Airflow DAG to run the pipeline.
installed and running.
When running this on MacOS, you might need to set the no_proxy
environment variable to prevent crashes due to a bug in Airflow (see for more information):
This command will start up an Airflow server on your local machine. During the startup, it will print a username and password which you can use to log in to the Airflow UI .
A remote ZenML server deployed to the cloud. See the for more information.
A deployed Airflow server. See the for more information.
A as part of your stack.
A as part of your stack.
In the remote case, the Airflow orchestrator works differently than other ZenML orchestrators. Executing a python file which runs a pipeline by calling pipeline.run()
will not actually run the pipeline, but instead will create a .zip
file containing an Airflow representation of your ZenML pipeline. In one additional step, you need to make sure this zip file ends up in the of your Airflow deployment.
ZenML will build a Docker image called <CONTAINER_REGISTRY_URI>/zenml:<PIPELINE_NAME>
which includes your code and use it to run your pipeline steps in Airflow. Check out if you want to learn more about how ZenML builds these images and how you can customize them.
You can on Airflow similarly to other orchestrators. However, note thatAirflow schedules always need to be set in the past, e.g.,:
Airflow comes with its own UI that you can use to find further details about your pipeline runs, such as the logs of your steps. For local Airflow, you can find the Airflow UI at by default.
For additional configuration of the Airflow orchestrator, you can pass AirflowOrchestratorSettings
when defining or running your pipeline. Check out the for a full list of available attributes and for more information on how to specify settings.
Note that if you wish to use this orchestrator to run steps on a GPU, you will need to follow to ensure that it works. It requires adding some extra settings customization and is essential to enable CUDA for the GPU to give its full acceleration.
A Python file that reads this configuration file and actually creates the Airflow DAG. We call this file the DAG generator
and you can find the implementation .
If you need more control over how the Airflow DAG is generated, you can provide a custom DAG generator file using the setting custom_dag_generator
. This setting will need to reference a Python module that can be imported into your active Python environment. It will additionally need to contain the same classes (DagConfiguration
and TaskConfiguration
) and constants (ENV_ZENML_AIRFLOW_RUN_ID
, ENV_ZENML_LOCAL_STORES_PATH
and CONFIG_FILENAME
) as the . For this reason, we suggest starting by copying the original and modifying it according to your needs.
Check out our docs on how to apply settings to your pipelines .
For more information and a full list of configurable attributes of the Airflow orchestrator, check out the .