Your company has data. Possibly lots of it. Probably stored in many places.
You probably also have at least one (probably more) scheduled data transformation process(es) where you load data to a data warehouse, aggregate log data, and/or extract-transform-load (ETL) data for integration into 3rd party data systems.
These are commonly called "data pipelines".
People typically start out managing these pipelines with
cron jobs, but with the volumes of data that modern apps can churn through,
cron won't cut it for long.
Data Pipelines And Directed Acyclic Graphs
A pipeline manager executes its tasks on a recurring,
e.g., daily or hourly.
But that's about the only similarity with
Data pipelines usually follow a directed acyclic graph (DAG) pattern:
a series of tasks in a data pipeline, executed in a specific order.
A task in your data pipeline will require that the previous task succeed before execution.
DAGs are a very powerful concept; for instance, the abstract syntax tree at the heart of any computer program is a type of directed acyclic graph with specific limitations.
You might be using
cron for your data pipelines now.
Or perhaps you have something more sophisticated, like
Amazon Web Services' eponymous solution,
or even a home-grown solution,
Across the board, these types of systems act to:
- schedule execution,
- manage workflow,
- track the states of tasks,
- and notify on task events like failure, success, etc.
These solutions also might support:
- branched execution,
- parallel task execution,
- and reprocessing.
A solution might also have a web-based user interface, possess a CLI (command line interface), and/or include a web application which allows you to visually inspect or interact with pipelines, track task execution status, and run ad-hoc reports.
In addition, you may have disparate data systems like Hadoop/HDFS and Hive, raw files stored in Amazon S3, relational database management systems (e.g. MySQL, Postgres, Microsoft SQL Server). Your existing solutions are probably custom to the business needs and maybe require a great deal of engineering resources to extend their functionality.
Airflow gets it right:
- It has a rich CLI and a Web UI
- It supports reprocessing old data via its CLI's
- Easy to extend with Python, or Bash commands
- Scale-out: distributed task processing with Celery
- Multi-tenancy: a single Airflow instance can orchestrate disparate pipelines
- Task-level callbacks: you can alert on failure, e.g., post a message to engineering team IM
- Fine-grained intervals: e.g., configure a DAG to execute every two hours instead of on a daily basis
- Branching task execution
- Wide and growing support for different database systems
Airbnb announced Airflow in summer 2015.
I'll let the announcement speak for Airflow's capabilities
and why Airbnb thought it necessary for their needs.
Airflow is written in Python,
is installable via
and supports a variety of data systems -
MySQL, et cetera.
Airflow has an extensible object model, so you can readily extend it
via Python code
and connect it to anything that has a Python library.
The Airflow documentation is good, so I won't retread its content. However, Airflow is undergoing rapid development. There may come a time when you need to run a more recent version than the version published to PyPI.
To run Airflow from HEAD of branch
<pre><code class="nohighlight hljs">pip install --upgrade git+https://github.com/airbnb/airflow.git@master#egg=airflow</code></pre>
Also, Airflow uses the subpackages feature of PyPI. You can install the HEAD of a subpackage as follows:
<pre><code class="nohighlight hljs">pip install --upgrade git+https://github.com/airbnb/airflow.git@master#egg=airflow[hive]</code></pre>
For more context, check out the Airflow installation docs.
In our careers, we've authored multiple custom data pipeline systems and have used other published systems. We've worked with numerous disparate database systems, including MySQL, Postgres, Hive, MongoDB, Microsoft SQL Server, and, of course, structured and unstructured log files. The thing that is crucial and often the most important with data pipeline systems is error handling and the activities an operator must execute to respond to the errors, i.e. backfilling data. Airflow gets this right. We've set it up in production and been quite happy with it.
Your company can (probably) put Airflow to good use too. Airflow allows for rapid iteration and prototyping, and Python is a great glue language: it has great database library support and is trivial to integrate with AWS via Boto.
You can even use Ansible, Panda Strike's favorite configuration management system, within a DAG, via its Python API, to do more automation within your data pipelines:
- deploy and configure Airflow Celery workers
- provision ephemeral database systems - e.g. Hadoop/EMR cluster, MySQL/RDS, Redis/Elasticache, S3 buckets, et cetera
- create CloudWatch alarms
- perform maintenance tasks - e.g. cleanup ephemeral files from file systems, HDFS, etc.
- provision and control robot armies
Thanks to Airbnb for open sourcing such a well-conceived and well-executed project!