Apache Airflow helped us scale from 10 to 100+ users across 20+ teams with a variety of use cases. By writing our own plugins and creating custom user roles, we off-loaded our infrastructure team and gave power back to the airflow users.
What was the problem?
Many years ago we started out with our own orchestration framework. Due to all the required custom functionality it made sense at the time. However, quickly we realized creating an orchestration tool is not to be underestimated. With the quickly increasing number of users and teams, time spent on fixing issues increased, severely limiting development speed. Furthermore, due to it not being open source, we constantly had to make the effort ourselves to stay up to date with the industry standards and tools. We needed a tool for our Big Data Platform to schedule and execute many ETL jobs while at the same time, giving our users the possibility to redo or undo their tasks.
How did Apache Airflow help to solve this problem?
Apache Airflow enabled us to extend upon the already existing operators and sensors to make writing ETL DAGs as easy as possible. Within a couple minutes of training, data scientists are able to write their own DAGs containing an Apache Spark job and its corresponding dependencies. The Web UI allows our data scientists to closely monitor the status and logs of the jobs so that they can quickly interfere if something is not going as planned. We created our own access groups such that teams have full privileges on their own DAGs while only read privileges on other teams DAGs.
One powerful functionality of Apache Airflow is the ability to backfill. This is helpful when new tasks are introduced or old jobs need to be rerun. By creating our own plugin for Apache Airflow, we built a simple tool to streamline back-filling. Besides clearing the runs, it also clears the underlying data that was generated by the Spark Job. Coming from Apache Airflow 1.10, this plugin only required minor changes to support Apache Airflow 2.0.
What are the results?
We started out with having a full team just working on our orchestration tool. With the help of Apache Airflow we managed to give the responsibility of maintaining DAGs back to the data scientist teams. This allowed us to grow quicker than ever to 20 teams that own in total approximately 200 DAGs and over 5000 tasks. In the meantime, our team has been able to extend Apache Airflow further while also focussing on getting other new exciting technologies on-board. With airflow, we now spend our time making progress instead of getting stuck fixing all sorts of issues.