Apache Airflow Tutorial – Part 2 Install with DockerMar 09, 2019
Install Apache Airflow With Docker Overview
In this part of the series I will cover how to get a nice Apache Airflow instance up and running with docker. You won't need to have anything installed locally besides docker, which is fantastic, because configuring all these pieces individually would be kind of awful!
This is the exact same setup and configuration I use for my own Apache Airflow instances. When I run Apache Airflow in production I don't use Postgres in a docker container, as that is not recommended, but this setup is absolutely perfect for dev and will very closely match your production requirements!
Following along with a blog post is great, but the best way to learn is to just jump in and start building. Get the Apache Airflow Docker Dev Stack here.
Celery Job Queue
Getting an instance Apache Airflow up and running looks very similar to a Celery instance. This is because Airflow uses Celery behind the scenes to execute tasks. Read more about Celery and its Architecture with my blog post here.
Airflow Instance Configuration with Docker Compose
If you have read absolutely anything else on this blog you will realize I am a bit nutty about docker. I love docker. I want everyone to use docker, because then I get to spend more time deploying more cool stuff instead of debugging what went wrong with (still) cool stuff.
An Airflow instance is fairly complex. It has a scheduler, one or more workers, a web UI, a message queueing system, and a database backend to store results.
I would suggest not trying to install all these things by hand on your local computer, because that would be painful. Ok? Just don't do it. Trust me here.
Docker compose composes one or more docker containers into services. In this case we have a lot, so let's cover them one by one!
Message Queue - RabbitMQ Service
Celery uses a message queueing service to communicate between its workers and the scheduler. You may be wondering why you can't just write a task that returns True/False and move on with your life. You can't because all your tasks run in isolation, using either threads or processes, and therefore can't talk to other tasks or the parent process without something in between.
If you find this confusing don't worry about it, just know that there is something there that acts as a bridge between tasks and the thing that executes the tasks.
Database Backend - Celery Results Postgres DB
This is a postgresql instance for the celery results. I'm not totally sure what's stored here, but I could always go and investigate it since I also have adminer installed!
Database Backend - Airflow Postgres DB
This is the actual airflow database. It has a table for DAGs, tasks, users, and roles.
I could have used MySQL for this, but timestamps are treated a bit differently between MySQL and PostgreSQL. Just using PostgreSQL was the path of least resistance, and since I don't ever directly interact with the DB I don't really care much.
Database UI - Adminer
Adminer is a web based database UI. I kind of prefer phpmyadmin, but that is only available for MySQL, which we aren't using here. I I only use it for curiousities sake and the occasional debugging.
Airflow Database Initializer
This runs airflow initdb command, which initializes the postgresql database.
This could technically be combined with any of the other airflow_* services, but I kind of like keeping it separate.
If you've ever worked on an HPC this is sort of analagous to SLURM / SGE / PBS. If you haven't it's very much what it sounds like. It figures out which tasks should run when, based either on their execution dates and concurrency parameters.
bash -c "/home/airflow/scripts/wait-for-it.sh -p 5432 -h airflow_postgres_db -- sleep 120; airflow scheduler"
The Airflow Worker is what actually runs your tasks.
By putting this in its own container, we could potentially start playing around with the concurrency, create many instances of the worker, and carry on with world domination.
bash -c "/home/airflow/scripts/wait-for-it.sh -p 5432 -h airflow_postgres_db -- sleep 120; airflow worker"
This brings up the super pretty web UI that you see all kinds of screenshots of.
bash -c "/home/airflow/scripts/wait-for-it.sh -p 5432 -h airflow_postgres_db -- sleep 120; airflow webserver"
You will see the dags, scripts, pkgs, and ssh directories bound in the airflow instance. You will also see /var/run/docker.sock bound, which we will go over later when we cover the DockerOperator.
Bring it Up
Once you have the code from github bring up the docker compose instance with -
unzip airflow-template.zip cd airflow-template docker-compose up -d docker-compose logs airflow_webserver
You will have to wait a minute or so for everything to initialize.
Open up http://localhost:8089 in your browser and see the airflow magic!
Troubleshoot - Relation Log
You may see something about 'relation log' blah blah blah. I'm not totally sure why that happens, but if you either Ctrl+C and restart the instance or docker-compose restart, it will be fine.
That's it for Part 2. Hopefully you've seen and understand how the different pieces of the Airflow architecture relate to one another, and how to deploy them using docker.