Troubleshooting¶
Obscure task failures¶
Task state changed externally¶
There are many potential causes for a task’s state to be changed by a component other than the executor, which might cause some confusion when reviewing task instance or scheduler logs.
Below are some example scenarios that could cause a task’s state to change by a component other than the executor:
If a task’s DAG failed to parse on the worker, the scheduler may mark the task as failed. If confirmed, consider increasing core.dagbag_import_timeout and core.dag_file_processor_timeout.
The scheduler will mark a task as failed if the task has been queued for longer than scheduler.task_queued_timeout.
If a task becomes a zombie, it will be marked failed by the scheduler.
A user marked the task as successful or failed in the Airflow UI.
An external script or process used the Airflow REST API to change the state of a task.
LocalTaskJob killed¶
Sometimes, Airflow or some adjacent system will kill a task instance’s LocalTaskJob
, causing the task instance to fail.
Here are some examples that could cause such an event:
A DAG run timeout, specified by
dagrun_timeout
in the DAG’s definition.An Airflow worker running out of memory - Usually, Airflow workers that run out of memory receive a SIGKILL and are marked as a zombie and failed by the scheduler. However, in some scenarios, Airflow kills the task before that happens.