airflow.providers.cncf.kubernetes.operators.spark_kubernetes

Module Contents

Classes

SparkKubernetesOperator

Creates sparkApplication object in kubernetes cluster.

class airflow.providers.cncf.kubernetes.operators.spark_kubernetes.SparkKubernetesOperator(*, image=None, code_path=None, namespace='default', name=None, application_file=None, template_spec=None, get_logs=True, do_xcom_push=False, success_run_history_limit=1, startup_timeout_seconds=600, log_events_on_failure=False, reattach_on_restart=True, delete_on_termination=True, kubernetes_conn_id='kubernetes_default', **kwargs)[source]

Bases: airflow.providers.cncf.kubernetes.operators.pod.KubernetesPodOperator

Creates sparkApplication object in kubernetes cluster.

See also

For more detail about Spark Application Object have a look at the reference: https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/v1beta2-1.3.3-3.1.1/docs/api-docs.md#sparkapplication

Parameters
  • image (str | None) – Docker image you wish to launch. Defaults to hub.docker.com,

  • code_path (str | None) – path to the spark code in image,

  • namespace (str) – kubernetes namespace to put sparkApplication

  • name (str | None) – name of the pod in which the task will run, will be used (plus a random suffix if random_name_suffix is True) to generate a pod id (DNS-1123 subdomain, containing only [a-z0-9.-]).

  • application_file (str | None) – filepath to kubernetes custom_resource_definition of sparkApplication

  • template_spec – kubernetes sparkApplication specification

  • get_logs (bool) – get the stdout of the container as logs of the tasks.

  • do_xcom_push (bool) – If True, the content of the file /airflow/xcom/return.json in the container will also be pushed to an XCom when the container completes.

  • success_run_history_limit (int) – Number of past successful runs of the application to keep.

  • startup_timeout_seconds – timeout in seconds to startup the pod.

  • log_events_on_failure (bool) – Log the pod’s events if a failure occurs

  • reattach_on_restart (bool) – if the scheduler dies while the pod is running, reattach and monitor

  • delete_on_termination (bool) – What to do when the pod reaches its final state, or the execution is interrupted. If True (default), delete the pod; if False, leave the pod.

  • kubernetes_conn_id (str) – the connection to Kubernetes cluster

property template_body[source]

Templated body for CustomObjectLauncher.

template_fields = ['application_file', 'namespace', 'template_spec'][source]
template_fields_renderers[source]
template_ext = ('yaml', 'yml', 'json')[source]
ui_color = '#f4a460'[source]
BASE_CONTAINER_NAME = 'spark-kubernetes-driver'[source]
manage_template_specs()[source]
create_job_name()[source]
static create_labels_for_pod(context=None, include_try_number=True)[source]

Generate labels for the pod to track the pod in case of Operator crash.

Parameters
  • include_try_number (bool) – add try number to labels

  • context (dict | None) – task context provided by airflow DAG

Returns

dict.

Return type

dict

pod_manager()[source]
find_spark_job(context)[source]
get_or_create_spark_crd(launcher, context)[source]
process_pod_deletion(pod, *, reraise=True)[source]
hook()[source]
client()[source]
custom_obj_api()[source]
execute(context)[source]

Based on the deferrable parameter runs the pod asynchronously or synchronously.

on_kill()[source]

Override this method to clean up subprocesses when a task instance gets killed.

Any use of the threading, subprocess or multiprocessing module within an operator needs to be cleaned up, or it will leave ghost processes behind.

patch_already_checked(pod, *, reraise=True)[source]

Add an “already checked” annotation to ensure we don’t reattach on retries.

dry_run()[source]

Print out the spark job that would be created by this operator.

Was this entry helpful?