Installation¶
Getting Airflow¶
Airflow is published as apache-airflow
package in PyPI. Installing it however might be sometimes tricky
because Airflow is a bit of both a library and application. Libraries usually keep their dependencies open and
applications usually pin them, but we should do neither and both at the same time. We decided to keep
our dependencies as open as possible (in setup.py
) so users can install different version of libraries
if needed. This means that from time to time plain pip install apache-airflow
will not work or will
produce unusable Airflow installation.
In order to have repeatable installation, however, starting from Airflow 1.10.10 and updated in
Airflow 1.10.12 we also keep a set of “known-to-be-working” constraint files in the
constraints-master
and constraints-1-10
orphan branches.
Those “known-to-be-working” constraints are per major/minor python version. You can use them as constraint
files when installing Airflow from PyPI. Note that you have to specify correct Airflow version
and python versions in the URL.
Prerequisites
On Debian based Linux OS:
sudo apt-get update sudo apt-get install build-essential
Installing just Airflow
AIRFLOW_VERSION=1.10.15
PYTHON_VERSION="$(python --version | cut -d " " -f 2 | cut -d "." -f 1-2)"
# For example: 3.6
CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
# For example: https://raw.githubusercontent.com/apache/airflow/constraints-1.10.15/constraints-3.6.txt
pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"
Note
On November 2020, new version of PIP (20.3) has been released with a new, 2020 resolver. This resolver
does not yet work with Apache Airflow and might leads to errors in installation - depends on your choice
of extras. In order to install Airflow you need to either downgrade pip to version 20.2.4
pip upgrade --pip==20.2.4
or, in case you use Pip 20.3, you need to add option
--use-deprecated legacy-resolver
to your pip install command.
Installing with extras (for example postgres, google)
AIRFLOW_VERSION=1.10.15
PYTHON_VERSION="$(python --version | cut -d " " -f 2 | cut -d "." -f 1-2)"
CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
pip install "apache-airflow[postgres,google]==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"
Note
On November 2020, new version of PIP (20.3) has been released with a new, 2020 resolver. This resolver
does not yet work with Apache Airflow and might leads to errors in installation - depends on your choice
of extras. In order to install Airflow you need to either downgrade pip to version 20.2.4
pip upgrade --pip==20.2.4
or, in case you use Pip 20.3, you need to add option
--use-deprecated legacy-resolver
to your pip install command.
You need certain system level requirements in order to install Airflow. Those are requirements that are known to be needed for Linux system (Tested on Ubuntu Buster LTS) :
sudo apt-get install -y --no-install-recommends \
freetds-bin \
krb5-user \
ldap-utils \
libffi6 \
libsasl2-2 \
libsasl2-modules \
libssl1.1 \
locales \
lsb-release \
sasl2-bin \
sqlite3 \
unixodbc
You also need database client packages (Postgres or MySQL) if you want to use those databases.
If the airflow
command is not getting recognized (can happen on Windows when using WSL), then
ensure that ~/.local/bin
is in your PATH
environment variable, and add it in if necessary:
PATH=$PATH:~/.local/bin
Extra Packages¶
The apache-airflow
PyPI basic package only installs what’s needed to get started.
Subpackages can be installed depending on what will be useful in your
environment. For instance, if you don’t need connectivity with Postgres,
you won’t have to go through the trouble of installing the postgres-devel
yum package, or whatever equivalent applies on the distribution you are using.
Behind the scenes, Airflow does conditional imports of operators that require these extra dependencies.
Here’s the list of the subpackages and what they enable:
Fundamentals:
subpackage |
install command |
enables |
---|---|---|
all |
|
All Airflow features known to man |
all_dbs |
|
All databases integrations |
devel |
|
Minimum dev tools requirements |
devel_all |
|
All dev tools requirements |
devel_azure |
|
Azure development requirements |
devel_ci |
|
Development requirements used in CI |
devel_hadoop |
|
Airflow + dependencies on the Hadoop stack |
doc |
|
Packages needed to build docs |
password |
|
Password authentication for users |
Apache Software:
subpackage |
install command |
enables |
---|---|---|
atlas |
|
Apache Atlas to use Data Lineage feature |
cassandra |
|
Cassandra related operators & hooks |
druid |
|
Druid related operators & hooks |
hdfs |
|
HDFS hooks and operators |
hive |
|
All Hive related operators |
presto |
|
All Presto related operators & hooks |
webhdfs |
|
HDFS hooks and operators |
Services:
subpackage |
install command |
enables |
---|---|---|
aws |
|
Amazon Web Services |
azure |
|
Microsoft Azure |
azure_blob_storage |
|
Microsoft Azure (blob storage) |
azure_cosmos |
|
Microsoft Azure (cosmos) |
azure_container_instances |
|
Microsoft Azure (container instances) |
azure_data_lake |
|
Microsoft Azure (data lake) |
azure_secrets |
|
Microsoft Azure (secrets) |
azure |
|
Microsoft Azure |
cloudant |
|
Cloudant hook |
databricks |
|
Databricks hooks and operators |
datadog |
|
Datadog hooks and sensors |
gcp |
|
Google Cloud |
github_enterprise |
|
GitHub Enterprise auth backend |
|
Google Cloud (same as gcp) |
|
google_auth |
|
Google auth backend |
hashicorp |
|
Hashicorp Services (Vault) |
jira |
|
Jira hooks and operators |
qds |
|
Enable QDS (Qubole Data Service) support |
salesforce |
|
Salesforce hook |
sendgrid |
|
Send email using sendgrid |
segment |
|
Segment hooks and sensors |
sentry |
|
|
slack |
|
|
snowflake |
|
Snowflake hooks and operators |
vertica |
|
Vertica hook support as an Airflow backend |
Software:
subpackage |
install command |
enables |
---|---|---|
async |
|
Async worker classes for Gunicorn |
celery |
|
CeleryExecutor |
dask |
|
DaskExecutor |
docker |
|
Docker hooks and operators |
elasticsearch |
|
Elasticsearch hooks and Log Handler |
kubernetes |
|
Kubernetes Executor and operator |
mongo |
|
Mongo hooks and operators |
mssql (deprecated) |
|
Microsoft SQL Server operators and hook,
support as an Airflow backend. Uses pymssql.
Will be replaced by subpackage |
mysql |
|
MySQL operators and hook, support as an Airflow
backend. The version of MySQL server has to be
5.6.4+. The exact version upper bound depends
on version of |
oracle |
|
Oracle hooks and operators |
pinot |
|
Pinot DB hook |
postgres |
|
PostgreSQL operators and hook, support as an Airflow backend |
rabbitmq |
|
RabbitMQ support as a Celery backend |
redis |
|
Redis hooks and sensors |
samba |
|
|
statsd |
|
Needed by StatsD metrics |
virtualenv |
|
Other:
subpackage |
install command |
enables |
---|---|---|
cgroups |
|
Needed To use CgroupTaskRunner |
crypto |
|
Cryptography libraries |
grpc |
|
Grpc hooks and operators |
jdbc |
|
JDBC hooks and operators |
kerberos |
|
Kerberos integration for Kerberized Hadoop |
ldap |
|
LDAP authentication for users |
papermill |
|
Papermill hooks and operators |
ssh |
|
SSH hooks and Operator |
winrm |
|
WinRM hooks and operators |
Initializing Airflow Database¶
Airflow requires a database to be initialized before you can run tasks. If you’re just experimenting and learning Airflow, you can stick with the default SQLite option. If you don’t want to use SQLite, then take a look at Initializing a Database Backend to setup a different database.
After configuration, you’ll need to initialize the database before you can run tasks:
airflow db init