Airflow Orchestrator
Orchestrating your pipelines to run on Airflow.
When to use it
You should use the Airflow orchestrator if
you're looking for a proven production-grade orchestrator.
you're already using Airflow.
you want to run your pipelines locally.
you're willing to deploy and maintain Airflow.
How to deploy it
The Airflow orchestrator can be used to run pipelines locally as well as remotely. In the local case, no additional setup is necessary.
There are many options to use a deployed Airflow server:
If you're not using mlstacks
to deploy Airflow, there are some additional Python packages that you'll need to install in the Python environment of your Airflow server:
pydantic~=1.9.2
: The Airflow DAG files that ZenML creates for you require Pydantic to parse and validate configuration files.
How to use it
To use the Airflow orchestrator, we need:
The ZenML
airflow
integration installed. If you haven't done so, runThe orchestrator registered and part of our active stack:
In the local case, we need to install one additional Python package that is needed for the local Airflow server:
Once that is installed, we can start the local Airflow server by running:
As long as you didn't configure any custom value for the dag_output_dir
attribute of your orchestrator, running a pipeline locally is as simple as calling:
This call will produce a .zip
file containing a representation of your ZenML pipeline to the Airflow DAGs directory. From there, the local Airflow server will load it and run your pipeline (It might take a few seconds until the pipeline shows up in the Airflow UI).
The ability to provision resources using the zenml stack up
command is deprecated and will be removed in a future release. While it is still available for the Airflow orchestrator, we recommend following the steps to set up a local Airflow server manually.
Install the
apache-airflow
package in your Python environment where ZenML is installed.The Airflow environment variables are used to configure the behavior of the Airflow server. The following variables are particularly important to set:
AIRFLOW_HOME
: This variable defines the location where the Airflow server stores its database and configuration files. The default value is ~/airflow.AIRFLOW__CORE__DAGS_FOLDER
: This variable defines the location where the Airflow server looks for DAG files. The default value is <AIRFLOW_HOME>/dags.AIRFLOW__CORE__LOAD_EXAMPLES
: This variable controls whether the Airflow server should load the default set of example DAGs. The default value is false, which means that the example DAGs will not be loaded.AIRFLOW__SCHEDULER__DAG_DIR_LIST_INTERVAL
: This variable controls how often the Airflow scheduler checks for new or updated DAGs. By default, the scheduler will check for new DAGs every 30 seconds. This variable can be used to increase or decrease the frequency of the checks, depending on the specific needs of your pipeline.Run
airflow standalone
to initialize the database, create a user, and start all components for you.
Scheduling
Airflow UI
If you cannot see the Airflow UI credentials in the console, you can find the password in <GLOBAL_CONFIG_DIR>/airflow/<ORCHESTRATOR_UUID>/standalone_admin_password.txt
.
GLOBAL_CONFIG_DIR
depends on your OS. Runpython -c "from zenml.config.global_config import GlobalConfiguration; print(GlobalConfiguration().config_directory)"
to get the path for your machine.ORCHESTRATOR_UUID
is the unique ID of the Airflow orchestrator, but there should be only one folder here, so you can just navigate into that one.
The username will always be admin
.
Additional configuration
Enabling CUDA for GPU-backed hardware
Using different Airflow operators
Airflow operators specify how a step in your pipeline gets executed. As ZenML relies on Docker images to run pipeline steps, only operators that support executing a Docker image work in combination with ZenML. Airflow comes with two operators that support this:
the
DockerOperator
runs the Docker images for executing your pipeline steps on the same machine that your Airflow server is running on. For this to work, the server environment needs to have theapache-airflow-providers-docker
package installed.the
KubernetesPodOperator
runs the Docker image on a pod in the Kubernetes cluster that the Airflow server is deployed to. For this to work, the server environment needs to have theapache-airflow-providers-cncf-kubernetes
package installed.
You can specify which operator to use and additional arguments to it as follows:
Custom operators
If you want to use any other operator to run your steps, you can specify the operator
in your AirflowSettings
as a path to the python operator class:
Custom DAG generator file
To run a pipeline in Airflow, ZenML creates a Zip archive that contains two files:
A JSON configuration file that the orchestrator creates. This file contains all the information required to create the Airflow DAG to run the pipeline.
Last updated