Apache Livy Operators

Apache Livy is a service that enables easy interaction with a Spark cluster over a REST interface. It enables easy submission of Spark jobs or snippets of Spark code, synchronous or asynchronous result retrieval, as well as Spark Context management, all via a simple REST interface or an RPC client library.

LivyOperator

This operator wraps the Apache Livy batch REST API, allowing to submit a Spark application to the underlying cluster.

tests/system/apache/livy/example_livy.py[source]

    livy_java_task = LivyOperator(
        task_id="pi_java_task",
        file="/spark-examples.jar",
        num_executors=1,
        conf={
            "spark.shuffle.compress": "false",
        },
        class_name="org.apache.spark.examples.SparkPi",
    )

    livy_python_task = LivyOperator(task_id="pi_python_task", file="/pi.py", polling_interval=60)

    livy_java_task >> livy_python_task

You can also run this operator in deferrable mode by setting the parameter deferrable to True. This will lead to efficient utilization of Airflow workers as polling for job status happens on the triggerer asynchronously. Note that this will need triggerer to be available on your Airflow deployment.

tests/system/apache/livy/example_livy.py[source]

    livy_java_task_deferrable = LivyOperator(
        task_id="livy_java_task_deferrable",
        file="/spark-examples.jar",
        num_executors=1,
        conf={
            "spark.shuffle.compress": "false",
        },
        class_name="org.apache.spark.examples.SparkPi",
        deferrable=True,
    )

    livy_python_task_deferrable = LivyOperator(
        task_id="livy_python_task_deferrable", file="/pi.py", polling_interval=60, deferrable=True
    )

    livy_java_task_deferrable >> livy_python_task_deferrable

Reference

For further information, look at Apache Livy.

Was this entry helpful?