airflow.providers.amazon.aws.hooks.emr

Module Contents

Classes

EmrHook

Interact with Amazon Elastic MapReduce Service.

EmrServerlessHook

Interact with EMR Serverless API.

EmrContainerHook

Interact with AWS EMR Virtual Cluster to run, poll jobs and return job status

class airflow.providers.amazon.aws.hooks.emr.EmrHook(emr_conn_id=default_conn_name, *args, **kwargs)[source]

Bases: airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook

Interact with Amazon Elastic MapReduce Service.

Parameters

emr_conn_id (str | None) – Amazon Elastic MapReduce Connection. This attribute is only necessary when using the create_job_flow() method.

Additional arguments (such as aws_conn_id) may be specified and are passed down to the underlying AwsBaseHook.

See also

AwsBaseHook

conn_name_attr = emr_conn_id[source]
default_conn_name = emr_default[source]
conn_type = emr[source]
hook_name = Amazon Elastic MapReduce[source]
get_cluster_id_by_name(emr_cluster_name, cluster_states)[source]

Fetch id of EMR cluster with given name and (optional) states. Will return only if single id is found.

Parameters
  • emr_cluster_name (str) – Name of a cluster to find

  • cluster_states (list[str]) – State(s) of cluster to find

Returns

id of the EMR cluster

Return type

str | None

create_job_flow(job_flow_overrides)[source]

Create and start running a new cluster (job flow).

This method uses EmrHook.emr_conn_id to receive the initial Amazon EMR cluster configuration. If EmrHook.emr_conn_id is empty or the connection does not exist, then an empty initial configuration is used.

Parameters

job_flow_overrides (dict[str, Any]) – Is used to overwrite the parameters in the initial Amazon EMR configuration cluster. The resulting configuration will be used in the boto3 emr client run_job_flow method.

add_job_flow_steps(job_flow_id, steps=None, wait_for_completion=False, waiter_delay=None, waiter_max_attempts=None, execution_role_arn=None)[source]

Add new steps to a running cluster.

Parameters
  • job_flow_id (str) – The id of the job flow to which the steps are being added

  • steps (list[dict] | str | None) – A list of the steps to be executed by the job flow

  • wait_for_completion (bool) – If True, wait for the steps to be completed. Default is False

  • waiter_delay (int | None) – The amount of time in seconds to wait between attempts. Default is 5

  • waiter_max_attempts (int | None) – The maximum number of attempts to be made. Default is 100

  • execution_role_arn (str | None) – The ARN of the runtime role for a step on the cluster.

test_connection()[source]

Return failed state for test Amazon Elastic MapReduce Connection (untestable).

We need to overwrite this method because this hook is based on AwsGenericHook, otherwise it will try to test connection to AWS STS by using the default boto3 credential strategy.

static get_ui_field_behaviour()[source]

Returns custom UI field behaviour for Amazon Elastic MapReduce Connection.

class airflow.providers.amazon.aws.hooks.emr.EmrServerlessHook(*args, **kwargs)[source]

Bases: airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook

Interact with EMR Serverless API.

Additional arguments (such as aws_conn_id) may be specified and are passed down to the underlying AwsBaseHook.

See also

AwsBaseHook

JOB_INTERMEDIATE_STATES[source]
JOB_FAILURE_STATES[source]
JOB_SUCCESS_STATES[source]
JOB_TERMINAL_STATES[source]
APPLICATION_INTERMEDIATE_STATES[source]
APPLICATION_FAILURE_STATES[source]
APPLICATION_SUCCESS_STATES[source]
conn()[source]

Get the underlying boto3 EmrServerlessAPIService client (cached)

waiter(get_state_callable, get_state_args, parse_response, desired_state, failure_states, object_type, action, countdown=25 * 60, check_interval_seconds=60)[source]

Will run the sensor until it turns True.

Parameters
  • get_state_callable (Callable) – A callable to run until it returns True

  • get_state_args (dict) – Arguments to pass to get_state_callable

  • parse_response (list) – Dictionary keys to extract state from response of get_state_callable

  • desired_state (set) – Wait until the getter returns this value

  • failure_states (set) – A set of states which indicate failure and should throw an exception if any are reached before the desired_state

  • object_type (str) – Used for the reporting string. What are you waiting for? (application, job, etc)

  • action (str) – Used for the reporting string. What action are you waiting for? (created, deleted, etc)

  • countdown (int) – Total amount of time the waiter should wait for the desired state before timing out (in seconds). Defaults to 25 * 60 seconds.

  • check_interval_seconds (int) – Number of seconds waiter should wait before attempting to retry get_state_callable. Defaults to 60 seconds.

get_state(response, keys)[source]
class airflow.providers.amazon.aws.hooks.emr.EmrContainerHook(*args, virtual_cluster_id=None, **kwargs)[source]

Bases: airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook

Interact with AWS EMR Virtual Cluster to run, poll jobs and return job status Additional arguments (such as aws_conn_id) may be specified and are passed down to the underlying AwsBaseHook.

See also

AwsBaseHook

Parameters

virtual_cluster_id (str | None) – Cluster ID of the EMR on EKS virtual cluster

INTERMEDIATE_STATES = ['PENDING', 'SUBMITTED', 'RUNNING'][source]
FAILURE_STATES = ['FAILED', 'CANCELLED', 'CANCEL_PENDING'][source]
SUCCESS_STATES = ['COMPLETED'][source]
TERMINAL_STATES = ['COMPLETED', 'FAILED', 'CANCELLED', 'CANCEL_PENDING'][source]
create_emr_on_eks_cluster(virtual_cluster_name, eks_cluster_name, eks_namespace, tags=None)[source]
submit_job(name, execution_role_arn, release_label, job_driver, configuration_overrides=None, client_request_token=None, tags=None)[source]

Submit a job to the EMR Containers API and return the job ID. A job run is a unit of work, such as a Spark jar, PySpark script, or SparkSQL query, that you submit to Amazon EMR on EKS. See: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/emr-containers.html#EMRContainers.Client.start_job_run

Parameters
  • name (str) – The name of the job run.

  • execution_role_arn (str) – The IAM role ARN associated with the job run.

  • release_label (str) – The Amazon EMR release version to use for the job run.

  • job_driver (dict) – Job configuration details, e.g. the Spark job parameters.

  • configuration_overrides (dict | None) – The configuration overrides for the job run, specifically either application configuration or monitoring configuration.

  • client_request_token (str | None) – The client idempotency token of the job run request. Use this if you want to specify a unique ID to prevent two jobs from getting started.

  • tags (dict | None) – The tags assigned to job runs.

Returns

Job ID

Return type

str

get_job_failure_reason(job_id)[source]

Fetch the reason for a job failure (e.g. error message). Returns None or reason string.

Parameters

job_id (str) – Id of submitted job run

Returns

str

Return type

str | None

check_query_status(job_id)[source]

Fetch the status of submitted job run. Returns None or one of valid query states.

See: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/emr-containers.html#EMRContainers.Client.describe_job_run

Parameters

job_id (str) – Id of submitted job run

Returns

str

Return type

str | None

poll_query_status(job_id, max_tries=None, poll_interval=30, max_polling_attempts=None)[source]

Poll the status of submitted job run until query state reaches final state. Returns one of the final states.

Parameters
  • job_id (str) – Id of submitted job run

  • max_tries (int | None) – Deprecated - Use max_polling_attempts instead

  • poll_interval (int) – Time (in seconds) to wait between calls to check query status on EMR

  • max_polling_attempts (int | None) – Number of times to poll for query state before function exits

Returns

str

Return type

str | None

stop_query(job_id)[source]

Cancel the submitted job_run

Parameters

job_id (str) – Id of submitted job_run

Returns

dict

Return type

dict

Was this entry helpful?