airflow.providers.apache.spark.operators.spark_submit

Module Contents

Classes

SparkSubmitOperator

Wrap the spark-submit binary to kick off a spark-submit job; requires "spark-submit" binary in the PATH.

class airflow.providers.apache.spark.operators.spark_submit.SparkSubmitOperator(*, application='', conf=None, conn_id='spark_default', files=None, py_files=None, archives=None, driver_class_path=None, jars=None, java_class=None, packages=None, exclude_packages=None, repositories=None, total_executor_cores=None, executor_cores=None, executor_memory=None, driver_memory=None, keytab=None, principal=None, proxy_user=None, name='arrow-spark', num_executors=None, status_poll_interval=1, application_args=None, env_vars=None, verbose=False, spark_binary=None, properties_file=None, queue=None, deploy_mode=None, use_krb5ccache=False, **kwargs)[source]

Bases: airflow.models.BaseOperator

Wrap the spark-submit binary to kick off a spark-submit job; requires “spark-submit” binary in the PATH.

See also

For more information on how to use this operator, take a look at the guide: SparkSubmitOperator

Parameters
  • application (str) – The application that submitted as a job, either jar or py file. (templated)

  • conf (dict[str, Any] | None) – Arbitrary Spark configuration properties (templated)

  • conn_id (str) – The spark connection id as configured in Airflow administration. When an invalid connection_id is supplied, it will default to yarn.

  • files (str | None) – Upload additional files to the executor running the job, separated by a comma. Files will be placed in the working directory of each executor. For example, serialized objects. (templated)

  • py_files (str | None) – Additional python files used by the job, can be .zip, .egg or .py. (templated)

  • jars (str | None) – Submit additional jars to upload and place them in executor classpath. (templated)

  • driver_class_path (str | None) – Additional, driver-specific, classpath settings. (templated)

  • java_class (str | None) – the main class of the Java application

  • packages (str | None) – Comma-separated list of maven coordinates of jars to include on the driver and executor classpaths. (templated)

  • exclude_packages (str | None) – Comma-separated list of maven coordinates of jars to exclude while resolving the dependencies provided in ‘packages’ (templated)

  • repositories (str | None) – Comma-separated list of additional remote repositories to search for the maven coordinates given with ‘packages’

  • total_executor_cores (int | None) – (Standalone & Mesos only) Total cores for all executors (Default: all the available cores on the worker)

  • executor_cores (int | None) – (Standalone & YARN only) Number of cores per executor (Default: 2)

  • executor_memory (str | None) – Memory per executor (e.g. 1000M, 2G) (Default: 1G)

  • driver_memory (str | None) – Memory allocated to the driver (e.g. 1000M, 2G) (Default: 1G)

  • keytab (str | None) – Full path to the file that contains the keytab (templated)

  • principal (str | None) – The name of the kerberos principal used for keytab (templated)

  • proxy_user (str | None) – User to impersonate when submitting the application (templated)

  • name (str) – Name of the job (default airflow-spark). (templated)

  • num_executors (int | None) – Number of executors to launch

  • status_poll_interval (int) – Seconds to wait between polls of driver status in cluster mode (Default: 1)

  • application_args (list[Any] | None) – Arguments for the application being submitted (templated)

  • env_vars (dict[str, Any] | None) – Environment variables for spark-submit. It supports yarn and k8s mode too. (templated)

  • verbose (bool) – Whether to pass the verbose flag to spark-submit process for debugging

  • spark_binary (str | None) – The command to use for spark submit. Some distros may use spark2-submit or spark3-submit. (will overwrite any spark_binary defined in the connection’s extra JSON)

  • properties_file (str | None) – Path to a file from which to load extra properties. If not specified, this will look for conf/spark-defaults.conf.

  • queue (str | None) – The name of the YARN queue to which the application is submitted. (will overwrite any yarn queue defined in the connection’s extra JSON)

  • deploy_mode (str | None) – Whether to deploy your driver on the worker nodes (cluster) or locally as a client. (will overwrite any deployment mode defined in the connection’s extra JSON)

  • use_krb5ccache (bool) – if True, configure spark to use ticket cache instead of relying on keytab for Kerberos login

template_fields: Sequence[str] = ('application', 'conf', 'files', 'py_files', 'jars', 'driver_class_path', 'packages',...[source]
ui_color[source]
execute(context)[source]

Call the SparkSubmitHook to run the provided spark job.

on_kill()[source]

Override this method to clean up subprocesses when a task instance gets killed.

Any use of the threading, subprocess or multiprocessing module within an operator needs to be cleaned up, or it will leave ghost processes behind.

Was this entry helpful?