Get a Fully Configured Apache Airflow Docker Dev Stack with Bitnami

apache airflow Aug 02, 2020

I've been using it for around 2 years now to build out custom workflow interfaces, like those used for Laboratory Information Management Systems (LIMs), Computer Vision pre and postprocessing pipelines, and to set and forget other genomics pipelines.

My favorite feature of Airflow is how completely agnostic it is to the work you are doing or where that work is taking place. It could take place locally, on a Docker image, on Kubernetes, on any number of AWS services, on an HPC system, etc. Using Airflow allows me to concentrate on the business logic of what I'm trying to accomplish without getting too bogged down in implementation details.

During that time I've adopted a set of systems that I use to quickly build out the main development stack with Docker and Docker Compose, using the Bitnami Apache Airflow stack. Generally, I either deploy the stack to production using either the same Docker compose stack if its a small enough instance that is isolated, or with Kubernetes when I need to interact with other services or file systems.

Bitnami vs Roll Your Own

I used to roll my own Airflow containers using Conda. I still use this approach for most of my other containers, including micro services that interact with my Airflow system, but configuring Airflow is a lot more than just installing packages. Also, even just installing those packages is a pain and I could rarely count on a rebuild actually working without some pain. Then, on top of the packages you need to configure database connections and a message queue.

In comes the Bitnami Apache Airflow docker compose stack for dev and Bitnami Apache Airflow Helm Chart for prod!

Bitnami, in their own words:

Bitnami makes it easy to get your favorite open source software up and running on any platform, including your laptop, Kubernetes and all the major clouds. In addition to popular community offerings, Bitnami, now part of VMware, provides IT organizations with an enterprise offering that is secure, compliant, continuously maintained and customizable to your organizational policies. https://bitnami.com/

Bitnami stacks (usually) work completely the same from their Docker Compose stacks to their Helm charts. This means I can test and develop locally using my compose stack, build out new images, versions, packages, etc, and then deploy to Kubernetes. The configuration, environmental variables, and everything else acts the same. It would be a fairly large undertaking to do all this from scratch, so I use Bitnami.

They have plenty of enterprise offerings, but everything included here is open source and there is no pay wall involved.

And no, I am not affiliated with Bitnami, although I have kids that eat a lot and don't have any particular ethical aversions to selling out. ;-) I've just found their offerings to be excellent.

Grab the Source Code

Everything you need to follow along is included in the post, or you can subscribe to the DevOps for Data Scientists Tutorials newsletter and get the source code delivered to you in a nice zip file.

Project Structure

I like to have my projects organized so that I can run tree and have a general idea of what's happening.

Apache Airflow has 3 main components, the application, the worker, and the scheduler. Each of these has it's own Docker image to separate out the services. Additionally, there is a database and an message queue, but we won't be doing any customization to these.

.
└── docker
    └── bitnami-apache-airflow-1.10.10
        ├── airflow
        │   └── Dockerfile
        ├── airflow-scheduler
        │   └── Dockerfile
        ├── airflow-worker
        │   └── Dockerfile
        ├── dags
        │   └── tutorial.py
        ├── docker-compose.yml

So what we have here is a directory called bitnami-apache-airflow-1.10.10. Which brings us to a very important points! Pin your versions! It will save you so, so much pain and frustration!

Then we have one Dockerfile per Airflow piece.

Create this directory structure with:

mkdir -p docker/bitnami-apache-airflow-1.10.10/{airflow,airflow-scheduler,airflow-worker,dags}

The Docker Compose File

This is my preference for the docker-compose.yml file. I made a few changes for my own preferences, mostly that I pin versions, build my own Docker images, I have volume mounts for the dags, plugins, and database backups along with adding in the docker socket so I can run DockerOperators from within my stack.

You can always go and grab the original docker-compose here.

version: '2'

services:
  postgresql:
    image: 'docker.io/bitnami/postgresql:10-debian-10'
    volumes:
      - 'postgresql_data:/bitnami/postgresql'
    environment:
      - POSTGRESQL_DATABASE=bitnami_airflow
      - POSTGRESQL_USERNAME=bn_airflow
      - POSTGRESQL_PASSWORD=bitnami1
      - ALLOW_EMPTY_PASSWORD=yes
  redis:
    image: docker.io/bitnami/redis:5.0-debian-10
    volumes:
      - 'redis_data:/bitnami'
    environment:
      - ALLOW_EMPTY_PASSWORD=yes
  airflow-scheduler:
#    image: docker.io/bitnami/airflow-scheduler:1-debian-10
    build:
      context: airflow-scheduler
    environment:
      - AIRFLOW_DATABASE_NAME=bitnami_airflow
      - AIRFLOW_DATABASE_USERNAME=bn_airflow
      - AIRFLOW_DATABASE_PASSWORD=bitnami1
      - AIRFLOW_EXECUTOR=CeleryExecutor
      # If you'd like to load the example DAGs change this to yes!
      - AIRFLOW_LOAD_EXAMPLES=no
      # only works with 1.10.11
      #- AIRFLOW__WEBSERVER__RELOAD_ON_PLUGIN_CHANGE=true
      #- AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION=False
    volumes:
      - airflow_scheduler_data:/bitnami
      - ./plugins:/opt/bitnami/airflow/plugins
      - ./dags:/opt/bitnami/airflow/dags
      - ./db_backups:/opt/bitnami/airflow/db_backups
      - /var/run/docker.sock:/var/run/docker.sock
  airflow-worker:
#    image: docker.io/bitnami/airflow-worker:1-debian-10
    build:
      context: airflow-worker
    environment:
      - AIRFLOW_DATABASE_NAME=bitnami_airflow
      - AIRFLOW_DATABASE_USERNAME=bn_airflow
      - AIRFLOW_DATABASE_PASSWORD=bitnami1
      - AIRFLOW_EXECUTOR=CeleryExecutor
      - AIRFLOW_LOAD_EXAMPLES=no
      # only works with 1.10.11
      #- AIRFLOW__WEBSERVER__RELOAD_ON_PLUGIN_CHANGE=true
      #- AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION=False
    volumes:
      - airflow_worker_data:/bitnami
      - ./plugins:/opt/bitnami/airflow/plugins
      - ./dags:/opt/bitnami/airflow/dags
      - ./db_backups:/opt/bitnami/airflow/db_backups
      - /var/run/docker.sock:/var/run/docker.sock
  airflow:
#    image: docker.io/bitnami/airflow:1-debian-10
    build:
      # You can also specify the build context
      # as cwd and point to a different Dockerfile
      context: .
      dockerfile: airflow/Dockerfile
    environment:
      - AIRFLOW_DATABASE_NAME=bitnami_airflow
      - AIRFLOW_DATABASE_USERNAME=bn_airflow
      - AIRFLOW_DATABASE_PASSWORD=bitnami1
      - AIRFLOW_EXECUTOR=CeleryExecutor
      - AIRFLOW_LOAD_EXAMPLES=no
      # only works with 1.10.11
      #- AIRFLOW__WEBSERVER__RELOAD_ON_PLUGIN_CHANGE=True
      #- AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION=False
    ports:
      - '8080:8080'
    volumes:
      - airflow_data:/bitnami
      - ./dags:/opt/bitnami/airflow/dags
      - ./plugins:/opt/bitnami/airflow/plugins
      - ./db_backups:/opt/bitnami/airflow/db_backups
      - /var/run/docker.sock:/var/run/docker.sock
volumes:
  airflow_scheduler_data:
    driver: local
  airflow_worker_data:
    driver: local
  airflow_data:
    driver: local
  postgresql_data:
    driver: local
  redis_data:
    driver: local

Pin your versions

The version of Apache Airflow used here is 1.10.10. The 1.10.11 has some cool updates I would like to incorporate, so I will keep an eye on it!

You can always keep up with the latest Apache Airflow versions by checking out the changelog on the main site.

We are using Bitnami, which has bots that automatically build and update their images as new releases come along.

While this approach is great for bots, I highly do not recommend just hoping that the latest version will be backwards compatible and work with your setup.

Instead, pin a version, and when a new version comes along test it out in your dev stack. At the time of writing the most recent version is 1.10.11, but it doesn't quite work out of the box, so we are using 1.10.10.

Bitnami Apache Airflow Docker Tags

Generally speaking, a docker tag corresponds to the application version. Sometimes there are other variants as well, such as base OS. Here we can just go with the application version.

Bitnami Apache Airflow Scheduler Image Tags

Bitnami Apache Airflow Worker Image Tags

Bitnami Apache Airflow Web Image Tags

Build Custom Images

In our docker-compose we have placeholders in order to build custom images.

We'll just create a minimal Docker file for now. Later I'll show you how to customize your docker container with extra system or python packages.

Airflow Application

echo "FROM docker.io/bitnami/airflow:1.10.10" > docker/bitnami-apache-airflow-1.10.10/airflow/Dockerfile

Will give you this airflow application docker file.

FROM docker.io/bitnami/airflow:1.10.10

Airflow Scheduler

echo "FROM docker.io/bitnami/airflow-scheduler:1.10.10" > docker/bitnami-apache-airflow-1.10.10/airflow-scheduler/Dockerfile

Will give you this airflow scheduler docker file.

FROM docker.io/bitnami/airflow-scheduler:1.10.10

Airflow Worker

echo "FROM docker.io/bitnami/airflow-worker:1.10.10" > docker/bitnami-apache-airflow-1.10.10/airflow-worker/Dockerfile

Will give you this airflow worker docker file.

FROM docker.io/bitnami/airflow-worker:1.10.10

Bring Up The Stack

Grab the docker-compose file above and let's get rolling!

cd code/docker/bitnami-apache-airflow-1.10.10
# Bring it up in foreground
docker-compose up
# Bring it up in the background
# docker-compose up -d

If this is your first time running the command this will take some time. Docker will fetch any images it doesn't already have, and build all the airflow-* images.

Navigate to the UI

Once everything is up and running navigate to the UI at http://localhost:8080.

Unless you changed the configuration, your default username/password is user/bitnami.

Login to check out your Airflow web UI!

Add in a Custom DAG

Here's a DAG that I grabbed from the Apache Airflow Tutorial. I've only included it here for the sake of completeness.

from datetime import timedelta

# The DAG object; we'll need this to instantiate a DAG
from airflow import DAG
# Operators; we need this to operate!
from airflow.operators.bash_operator import BashOperator
from airflow.utils.dates import days_ago
# These args will get passed on to each operator
# You can override them on a per-task basis during operator initialization
default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': days_ago(2),
    'email': ['airflow@example.com'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
    # 'queue': 'bash_queue',
    # 'pool': 'backfill',
    # 'priority_weight': 10,
    # 'end_date': datetime(2016, 1, 1),
    # 'wait_for_downstream': False,
    # 'dag': dag,
    # 'sla': timedelta(hours=2),
    # 'execution_timeout': timedelta(seconds=300),
    # 'on_failure_callback': some_function,
    # 'on_success_callback': some_other_function,
    # 'on_retry_callback': another_function,
    # 'sla_miss_callback': yet_another_function,
    # 'trigger_rule': 'all_success'
}
dag = DAG(
    'tutorial',
    default_args=default_args,
    description='A simple tutorial DAG',
    schedule_interval=timedelta(days=1),
)

# t1, t2 and t3 are examples of tasks created by instantiating operators
t1 = BashOperator(
    task_id='print_date',
    bash_command='date',
    dag=dag,
)

t2 = BashOperator(
    task_id='sleep',
    depends_on_past=False,
    bash_command='sleep 5',
    retries=3,
    dag=dag,
)
dag.doc_md = __doc__

t1.doc_md = """\
#### Task Documentation
You can document your task using the attributes `doc_md` (markdown),
`doc` (plain text), `doc_rst`, `doc_json`, `doc_yaml` which gets
rendered in the UI's Task Instance Details page.
![img](http://montcs.bloomu.edu/~bobmon/Semesters/2012-01/491/import%20soul.png)
"""
templated_command = """
{% for i in range(5) %}
    echo "{{ ds }}"
    echo "{{ macros.ds_add(ds, 7)}}"
    echo "{{ params.my_param }}"
{% endfor %}
"""

t3 = BashOperator(
    task_id='templated',
    depends_on_past=False,
    bash_command=templated_command,
    params={'my_param': 'Parameter I passed in'},
    dag=dag,
)

t1 >> [t2, t3]

Anyways, grab this file and put it in your code/ bitnami-apache-airflow-1.10.10/dags folder. The name of the file itself doesn't matter. The DAG name will be whatever you set in the file.

Airflow will restart itself automatically, and if you refresh the UI you should see your new tutorial DAG listed.

Build Custom Airflow Docker Containers

If you'd like to add additonal system or python packages you can do so.

# code/bitnami-apache-airflow-1.10.10/airflow/Dockerfile
FROM docker.io/bitnami/airflow:1.10.10
# From here - https://github.com/bitnami/bitnami-docker-airflow/blob/master/1/debian-10/Dockerfile

USER root

RUN apt-get update && apt-get upgrade -y && \
    apt-get install -y vim && \
    rm -r /var/lib/apt/lists /var/cache/apt/archives

RUN bash -c "source /opt/bitnami/airflow/venv/bin/activate && \
    pip install flask-restful && \
    deactivate"

To be clear, I don't especially endorse this approach anymore, except that I like to add flask-restful for creating custom REST API plugins.

I like to treat Apache Airflow the way I treat web applications. I've been burned too many times, so now my web apps take care of routing and rendering views, and absolutely nothing else.

Airflow is about the same, except it handles the business logic of my workflows and absolutely nothing else. If I have some crazy pandas/tensorflow/opencv/whatever stuff I need to do I'll build that into a separate microservice and not touch my main business logic. I like to think of Airflow as the spider that sits in the web.

Still, I'm paranoid enough that I like to build my own images so I can then push them to my own docker repo.

Wrap Up and Where to go from here

Now that you have your foundation its time to build out your data science workflows! Add some custom DAGs, create some custom plugins, and generally build stuff.

If you'd like to get the full picture of all my Apache Airflow tips and tricks, including:

  • Best practices for separating out your business logic from your Airflow Application
  • Build out custom plugins with REST APIs and Flask Blueprints
  • Deploy to production with Helm
  • Patch your Airflow instance to use CORs to build out interfaces with React, Angular, or another system.
  • CI/CD scripts to build and deploy your custom docker images.

Please check out the Apache Airflow Project Lab.

Cheat Sheet

Here are some hopefully helpful commands and resources.

Log into your Apache Airflow Instance

The default username and password is user and bitnami.

Docker Compose Commands

Build

cd code/bitnami-apache-airflow-1.10.10/
docker-compose build 

Bring up your stack! Running docker-compose up makes all your logs come up on STDERR/STDOUT.

cd code/bitnami-apache-airflow-1.10.10/
docker-compose build && docker-compose up 

If you'd like to run it in the background instead use -d.

cd code/bitnami-apache-airflow-1.10.10/
docker-compose build && docker-compose up -d 

Bitnami Apache Airflow Configuration

You can further customize your Airflow instance using environmental variables that you pass into the docker-compose file. Check out the README for details.

Load DAG files

Custom DAG files can be mounted to /opt/bitnami/airflow/dags.

Specifying Environment variables using Docker Compose

version: '2'

services:
  airflow:
    image: bitnami/airflow:latest
    environment:
      - AIRFLOW_FERNET_KEY=46BKJoQYlPPOexq0OhDZnIlNepKFf87WFwLbfzqDDho=
      - AIRFLOW_EXECUTOR=CeleryExecutor
      - AIRFLOW_DATABASE_NAME=bitnami_airflow
      - AIRFLOW_DATABASE_USERNAME=bn_airflow
      - AIRFLOW_DATABASE_PASSWORD=bitnami1
      - AIRFLOW_PASSWORD=bitnami123
      - AIRFLOW_USERNAME=user
      - AIRFLOW_EMAIL=user@example.com

Clean up after Docker

Docker can take up a lot of room on your filesystem.

If you'd like to clean up just the Airflow stack then:

cd code/docker/bitnami-apache-airflow-1.10.10
docker-compose stop
docker-compose rm -f -v

Running docker-compose rm -f forcibly removes all the containers, and the -v also removes all data volumes.

Remove all docker images everywhere

This will stop all running containers and remove them.

docker container stop $(docker container ls -aq)
docker system prune -f -a

This will remove all containers AND data volumes

docker system prune -f -a --volumes
Close

50% Complete

DevOps for Data Scientists Weekly Tutorials

Subscribe to the newsletter! You'll get a weekly tutorial on all the DevOps you need to know as a Data Scientist. Build Python Apps with Docker, Design and Deploy complex analyses with Apache Airflow, build computer vision platforms, and more.