I’m happy to announce that Apache Airflow 2.9.0 has been released! This time around we have new features for data-aware scheduling and a bunch of UI-related improvements.

Apache Airflow 2.9.0 contains over 550 commits, which include 38 new features, 70 improvements, 31 bug fixes, and 18 documentation changes.

Details:

📦 PyPI: https://pypi.org/project/apache-airflow/2.9.0/
📚 Docs: https://airflow.apache.org/docs/apache-airflow/2.9.0/
🛠 Release Notes: https://airflow.apache.org/docs/apache-airflow/2.9.0/release_notes.html
🐳 Docker Image: “docker pull apache/airflow:2.9.0”
🚏 Constraints: https://github.com/apache/airflow/tree/constraints-2.9.0

Airflow 2.9.0 is also the first release that supports Python 3.12. However, Pendulum 2 does not support Python 3.12, so you’ll need to use Pendulum 3 if you upgrade to Python 3.12.

New data-aware scheduling options

Logical operators and conditional expressions for DAG scheduling

When Datasets were added in Airflow 2.4, DAGs only had scheduling support for logical AND combinations of Datasets. Simply, you could schedule against more than one Dataset, but a DAG run would only be created once all the Datasets were updated after the last run. Now in Airflow 2.9, we support logical OR and even arbitrary combinations of AND and OR.

As an example, you can schedule a DAG whenever dataset_1 or dataset_2 are updated :

with DAG(schedule=(dataset_1 | dataset_2), ...):
    ...

You can have arbitrary combinations:

with DAG(schedule=((dataset_1 | dataset_2) & dataset_3), ...):
    ...

You can read more about this new functionality in the data-aware scheduling docs.

Combining Dataset and Time-Based Schedules

Airflow 2.9 comes with a new timetable, DatasetOrTimeSchedule, that allows you to schedule DAGs based on both dataset events and a timetable. Now you have the best of both worlds.

For example, to run whenever dataset_1 updates and at midnight UTC:

with DAG(
    schedule=DatasetOrTimeSchedule(
        timetable=CronTriggerTimetable("0 0 * * *", timezone="UTC"),
        datasets=[dag1_dataset],
    ),
    ...
):
    ...

Dataset Event REST API endpoints

New REST API endpoints have been introduced for creating, listing, and deleting dataset events. This makes it possible for external systems to notify Airflow about dataset updates and unlocks management of event queues for more sophisticated use cases.

See the Dataset API docs for more details.

Dataset UI Enhancements

The DAG’s graph view has been enhanced to display both the datasets it is scheduled on and those in the task outlets, providing a comprehensive overview of the datasets consumed and produced by the DAG.

Datasets in the graph view

The main datasets view now allows you to filter for both DAGs and datasets:

Dataset view filtering

When viewing a Dataset, you can now create a manual dataset event through the UI by clicking the play button shown in the top right here:

Creating manual Dataset event

Custom names for Dynamic Task Mapping

Gone are the days of clicking into index numbers and hunting for the dynamically mapped task you wanted to see! This has been a requested feature ever since task mapping was added in Airflow 2.3, and we are happy it’s finally here.

You can provide a map_index_template to mapped operators:

BashOperator.partial(
    task_id="hello",
    bash_command="echo Hello $NAME",
    map_index_template="{{ task.env['NAME'] }}",
).expand(
    env=[{"NAME": "John"}, {"NAME": "Bob"}, {"NAME": "Fred"}],
)

That template will be rendered after each task finishes running and will populate the name in the UI:

Dynamic Task Mapping custom names

More details on this, including a taskflow example, is available in the dynamic task mapping docs.

Object Storage as XCom Backend

You can now configure Object Storage to be used as an XCom backend, making it much easier to get XCom results into an object store. Deployment managers can configure the object store of their choice, a size threshold to route some results to the Airflow metadata database and some to the object store, and even a compression method to apply before the data is stored.

The following configuration will store anything above 1MB in S3 and will compress it using gzip:

[core]
xcom_backend = airflow.providers.common.io.xcom.backend.XComObjectStoreBackend

[common.io]
xcom_objectstorage_path = s3://conn_id@mybucket/key
xcom_objectstorage_threshold = 1048576
xcom_objectstorage_compression = gzip

See the docs on the object storage xcom backend for more details.

Display names for DAGs and Tasks

Get your emojis ready! You can now set a display name for dags and tasks, separate from the dag_id and task_id. This allows you to have localized display names in the UI, or just use a bunch of emojis.

Using dag_display_name and task_display_name, you can break away from the ascii handcuffs:

with DAG("not_a_fun_dag_id", dag_display_name="📣 Best DAG ever 🎉", ...):
    BashOperator(task_id="some_task", task_display_name="🥳 Fun task!", ...)

Display names for DAGs and tasks

Task log grouping

Airflow now has support for arbitrary grouping of task logs.

By default, pre-execute and post-execute logs are grouped and collapsed, making it easier to see your task logs:

Pre and post execute logs are grouped

You can also use this feature in your task code to make your logs easier to follow:

@task
def big_hello():
    print("::group::Setup our big Hello")
    greeting = ""
    for c in "Hello Airflow 2.9":
        greeting += c
        print(f"Adding {c} to our greeting. Current greeting: {greeting}")
    print("::endgroup::")
    print(greeting)

That custom group is collapsed by default:

Custom log grouping collapsed by default

And it can be expanded if you want to dig into the details:

Custom log grouping expanded

UI Modernization

In addition to all the UI improvements mentioned above, we have a bunch more improvements in Airflow 2.9!

The rest of the DAG level views have been moved into React and the grid view interface, allowing for a more cohesive experience. This includes the calendar, task duration, run duration (which replaces landing times), and the audit log. These weren’t just “moved”, they each were improved upon as well.

Here is the new run duration view, which replaces landing times. Users can toggle between landing times and simple run duration:

Run duration

And the new task duration view. Users can toggle queued time on/off and see the median value across the displayed runs:

Task duration

Additional new features

Here are just a few interesting new features since there are too many to list in full:

Contributors

Thanks to everyone who contributed to this release, including Amogh Desai, Andrey Anshin, Brent Bovenzi, Daniel Standish, Ephraim Anierobi, Hussein Awala, Jarek Potiuk, Jed Cunningham, Jens Scheffler, Tzu-ping Chung, Vincent Beck, Wei Lee, and over 120 others!

I’d especially like to thank our release manager, Ephraim, for getting this release out the door.

I hope you enjoy using Apache Airflow 2.9.0!

Share

Read also

Apache Airflow 2.7.0 is here

Jed Cunningham

Apache Airflow 2.7.0 has been released!

what's new in Apache Airflow 2.6.0

Jed Cunningham

Apache Airflow 2.6.0 has been released!