SQLExecuteQueryOperator to connect to Apache Druid¶

Use the SQLExecuteQueryOperator to execute SQL queries against an Apache Druid cluster.

Note

Previously, a dedicated operator for Druid might have been used. After deprecation, please use the SQLExecuteQueryOperator instead.

Note

Make sure you have installed the apache-airflow-providers-apache-druid package to enable Druid support.

Using the Operator¶

Use the conn_id argument to connect to your Apache Druid instance where the connection metadata is structured as follows:

Druid Airflow Connection Metadata¶
Parameter	Input
Host: string	Druid broker hostname or IP address
Schema: string	Not applicable (leave blank)
Login: string	Not applicable (leave blank)
Password: string	Not applicable (leave blank)
Port: int	Druid broker port (default: 8082)
Extra: JSON	Additional connection configuration, such as: `{"endpoint": "/druid/v2/sql/", "method": "POST"}`

An example usage of the SQLExecuteQueryOperator to connect to Apache Druid is as follows:

tests/system/apache/druid/example_druid.py[source]

    # Task: List all published datasources in Druid.
    list_datasources_task = SQLExecuteQueryOperator(
        task_id="list_datasources",
        sql="SELECT DISTINCT datasource FROM sys.segments WHERE is_published = 1",
    )

    # Task: Describe the schema for the 'wikipedia' datasource.
    # Note: This query returns column information if the datasource exists.
    describe_wikipedia_task = SQLExecuteQueryOperator(
        task_id="describe_wikipedia",
        sql=dedent("""
            SELECT COLUMN_NAME, DATA_TYPE
            FROM INFORMATION_SCHEMA.COLUMNS
            WHERE TABLE_NAME = 'wikipedia'
        """).strip(),
    )

    # Task: Count rows for the 'wikipedia' datasource.
    # Here we count the segments for 'wikipedia'. If the datasource is not ingested, it returns 0.
    select_count_from_datasource = SQLExecuteQueryOperator(
        task_id="select_count_from_datasource",
        sql="SELECT COUNT(*) FROM sys.segments WHERE datasource = 'wikipedia'",
    )

Reference¶

For further information, see:

Apache Druid Documentation

Note

Parameters provided directly via SQLExecuteQueryOperator() take precedence over those specified in the Airflow connection metadata (such as schema, login, password, etc).