Apache Airflow Tutorial – Part 2 Install with Docker

apache airflow distributed computing docker job queue python Mar 09, 2019

Install Apache Airflow With Docker Overview

In this part of the series I will cover how to get a nice Apache Airflow instance up and running with docker. You won't need to have anything installed locally besides docker, which is fantastic, because configuring all these pieces individually would be kind of awful!

This is the exact same setup and configuration I use for my own Apache Airflow instances. When I run Apache Airflow in production I don't use Postgres in a docker container, as that is not recommended, but this setup is absolutely perfect for dev and will very closely match your production requirements!

Following along with a blog post is great, but the best way to learn is to just jump in and start building. Get the Apache Airflow Docker Dev Stack here.

Celery Job Queue

Getting an instance Apache Airflow up and running looks very similar to a Celery instance. This is because Airflow uses Celery behind the scenes to execute tasks. Read more about Celery and its Architecture with my blog post here.

Airflow Instance Configuration with Docker Compose

If you have read absolutely anything else on this blog you will realize I am a bit nutty about docker. I love docker. I want everyone to use docker, because then I get to spend more time deploying more cool stuff instead of debugging what went wrong with (still) cool stuff.

An Airflow instance is fairly complex. It has a scheduler, one or more workers, a web UI, a message queueing system, and a database backend to store results.

I would suggest not trying to install all these things by hand on your local computer, because that would be painful. Ok? Just don't do it. Trust me here.

 

Services

Docker compose composes one or more docker containers into services. In this case we have a lot, so let's cover them one by one!

 

Message Queue - RabbitMQ Service

Celery uses a message queueing service to communicate between its workers and the scheduler. You may be wondering why you can't just write a task that returns True/False and move on with your life. You can't because all your tasks run in isolation, using either threads or processes, and therefore can't talk to other tasks or the parent process without something in between.

If you find this confusing don't worry about it, just know that there is something there that acts as a bridge between tasks and the thing that executes the tasks. 

 

Database Backend - Celery Results Postgres DB

This is a postgresql instance for the celery results. I'm not totally sure what's stored here, but I could always go and investigate it since I also have adminer installed! 

 

Database Backend - Airflow Postgres DB

This is the actual airflow database. It has a table for DAGs, tasks, users, and roles.

I could have used MySQL for this, but timestamps are treated a bit differently between MySQL and PostgreSQL. Just using PostgreSQL was the path of least resistance, and since I don't ever directly interact with the DB I don't really care much.

 

Database UI - Adminer

Adminer is a web based database UI. I kind of prefer phpmyadmin, but that is only available for MySQL, which we aren't using here. I I only use it for curiousities sake and the occasional debugging.

 

Airflow Database Initializer

This runs airflow initdb command, which initializes the postgresql database.

This could technically be combined with any of the other airflow_* services, but I kind of like keeping it separate. 

 

Airflow Scheduler

If you've ever worked on an HPC this is sort of analagous to SLURM / SGE / PBS. If you haven't it's very much what it sounds like. It figures out which tasks should run when, based either on their execution dates and concurrency parameters.

 

Command

bash -c "/home/airflow/scripts/wait-for-it.sh -p 5432 -h airflow_postgres_db -- sleep 120; airflow scheduler"
Bash

Airflow Worker

The Airflow Worker is what actually runs your tasks. 

By putting this in its own container, we could potentially start playing around with the concurrency, create many instances of the worker, and carry on with world domination.

 

Command

bash -c "/home/airflow/scripts/wait-for-it.sh -p 5432 -h airflow_postgres_db -- sleep 120; airflow worker"
Bash

Airflow Webserver

This brings up the super pretty web UI that you see all kinds of screenshots of.

 

Command

bash -c "/home/airflow/scripts/wait-for-it.sh -p 5432 -h airflow_postgres_db -- sleep 120; airflow webserver"
Bash

Volumes

You will see the dags, scripts, pkgs, and ssh directories bound in the airflow instance. You will also see /var/run/docker.sock bound, which we will go over later when we cover the DockerOperator.

 

Bring it Up

Once you have the code from github bring up the docker compose instance with - 

unzip airflow-template.zip
cd airflow-template
docker-compose up -d
docker-compose logs airflow_webserver
Bash

You will have to wait a minute or so for everything to initialize.

Open up http://localhost:8089 in your browser and see the airflow magic!

Troubleshoot - Relation Log

You may see something about 'relation log' blah blah blah. I'm not totally sure why that happens, but if you either Ctrl+C and restart the instance or docker-compose restart, it will be fine.

Wrap Up

That's it for Part 2. Hopefully you've seen and understand how the different pieces of the Airflow architecture relate to one another, and how to deploy them using docker.

Bioinformatics Solutions on AWS Newsletter 

Get the first 3 chapters of my book, Bioinformatics Solutions on AWS, as well as weekly updates on the world of Bioinformatics and Cloud Computing, completely free, by filling out the form next to this text.

Bioinformatics Solutions on AWS

If you'd like to learn more about AWS and how it relates to the future of Bioinformatics, sign up here.

We won't send spam. Unsubscribe at any time.