Airflow is a powerful open-source platform for orchestrating and scheduling workflows. It allows you to define, schedule, and monitor complex data pipelines. In this technical guide, we will explore how to set up an Airflow server and use it to run Python Jupyter notebooks periodically. Let’s get started!

Prerequisites Link to heading

Before we begin, ensure you have the following prerequisites:

  • Python and pip installed on your system
  • Docker and Docker Compose (optional, for running Airflow in containers)
  • Basic knowledge of Python, Jupyter notebooks, and Docker

Setting up Airflow Link to heading

There are multiple ways to set up an Airflow environment. In this guide, we’ll use Docker Compose to create a local Airflow server. Follow these steps:

Create a new directory for your Airflow project:

$ mkdir my-airflow-project
$ cd my-airflow-project

Create a docker-compose.yaml file with the following contents:

version: "3"
services:
  webserver:
    image: apache/airflow:2.2.0
    ports:
      - "8080:8080"
    volumes:
      - ./dags:/opt/airflow/dags
      - ./logs:/opt/airflow/logs
      - ./plugins:/opt/airflow/plugins
  scheduler:
    image: apache/airflow:2.2.0
    command: scheduler
    depends_on:
      - webserver
    volumes:
      - ./dags:/opt/airflow/dags
      - ./logs:/opt/airflow/logs
      - ./plugins:/opt/airflow/plugins

This configuration sets up an Airflow webserver and scheduler, mapping the local directories ./dags, ./logs, and ./plugins to the respective directories inside the containers.

Start the Airflow containers:

$ docker-compose up -d

This command downloads the Airflow images and starts the containers in the background. Access the Airflow web interface: Open your browser and navigate to http://localhost:8080. You should see the Airflow web UI, where you can configure and manage your workflows.

Running Python Jupyter Notebooks with Airflow Link to heading

To run Python Jupyter notebooks periodically using Airflow, we’ll create a simple DAG (Directed Acyclic Graph) that executes a notebook at specified intervals. Follow these steps:

  • Create a new Python file my_dag.py in the dags directory:
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.papermill_operator import PapermillOperator

default_args = {
    'owner': 'my_name',
    'depends_on_past': False,
    'start_date': datetime(2022, 1, 1),
    'retries': 1,
    'retry_delay': timedelta(minutes=5)
}

with DAG('my_dag', default_args=default_args, schedule_interval='0 0 * * *') as dag:
    task = PapermillOperator(
        task_id='execute_notebook',
        input_nb='/path/to/my_notebook.ipynb',
        output_nb='/path/to/output_notebook.ipynb',
        parameters={'param1': 'value1', 'param2': 'value2'}
    )

In this example, we define a DAG with a single task that executes a Jupyter notebook using the PapermillOperator. Adjust the paths to your specific notebook locations and provide any necessary parameters.

  • Place your Jupyter notebook file (my_notebook.ipynb) in a directory accessible to the Airflow containers.
  • Ensure the papermill package is installed. You can add it to your requirements.txt file or install it separately:
$ pip install papermill
  • Refresh the Airflow web interface, and you should see the my_dag listed. Toggle the DAG’s status to “On” to enable scheduling.
  • Airflow will automatically run the notebook according to the specified schedule interval.

By leveraging Airflow’s powerful workflow management capabilities, you can automate and orchestrate various data processing tasks, including notebook execution.