airflow.providers.amazon.aws.hooks.emr
¶
Module Contents¶
Classes¶
Interact with Amazon Elastic MapReduce Service (EMR). |
|
Interact with Amazon EMR Serverless. |
|
Interact with Amazon EMR Containers (Amazon EMR on EKS). |
- class airflow.providers.amazon.aws.hooks.emr.EmrHook(emr_conn_id=default_conn_name, *args, **kwargs)[source]¶
Bases:
airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook
Interact with Amazon Elastic MapReduce Service (EMR).
Provide thick wrapper around
boto3.client("emr")
.- Parameters
emr_conn_id (str | None) – Amazon Elastic MapReduce Connection. This attribute is only necessary when using the
airflow.providers.amazon.aws.hooks.emr.EmrHook.create_job_flow()
.
Additional arguments (such as
aws_conn_id
) may be specified and are passed down to the underlying AwsBaseHook.See also
- get_cluster_id_by_name(emr_cluster_name, cluster_states)[source]¶
Fetch id of EMR cluster with given name and (optional) states; returns only if single id is found.
See also
- create_job_flow(job_flow_overrides)[source]¶
Create and start running a new cluster (job flow).
See also
This method uses
EmrHook.emr_conn_id
to receive the initial Amazon EMR cluster configuration. IfEmrHook.emr_conn_id
is empty or the connection does not exist, then an empty initial configuration is used.- Parameters
job_flow_overrides (dict[str, Any]) – Is used to overwrite the parameters in the initial Amazon EMR configuration cluster. The resulting configuration will be used in the
EMR.Client.run_job_flow()
.
- add_job_flow_steps(job_flow_id, steps=None, wait_for_completion=False, waiter_delay=None, waiter_max_attempts=None, execution_role_arn=None)[source]¶
Add new steps to a running cluster.
See also
- Parameters
job_flow_id (str) – The id of the job flow to which the steps are being added
steps (list[dict] | str | None) – A list of the steps to be executed by the job flow
wait_for_completion (bool) – If True, wait for the steps to be completed. Default is False
waiter_delay (int | None) – The amount of time in seconds to wait between attempts. Default is 5
waiter_max_attempts (int | None) – The maximum number of attempts to be made. Default is 100
execution_role_arn (str | None) – The ARN of the runtime role for a step on the cluster.
- test_connection()[source]¶
Return failed state for test Amazon Elastic MapReduce Connection (untestable).
We need to overwrite this method because this hook is based on
AwsGenericHook
, otherwise it will try to test connection to AWS STS by using the default boto3 credential strategy.
- class airflow.providers.amazon.aws.hooks.emr.EmrServerlessHook(*args, **kwargs)[source]¶
Bases:
airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook
Interact with Amazon EMR Serverless.
Provide thin wrapper around
boto3.client("emr-serverless")
.Additional arguments (such as
aws_conn_id
) may be specified and are passed down to the underlying AwsBaseHook.- cancel_running_jobs(application_id, waiter_config=None, wait_for_completion=True)[source]¶
Cancel jobs in an intermediate state, and return the number of cancelled jobs.
If wait_for_completion is True, then the method will wait until all jobs are cancelled before returning.
Note: if new jobs are triggered while this operation is ongoing, it’s going to time out and return an error.
- class airflow.providers.amazon.aws.hooks.emr.EmrContainerHook(*args, virtual_cluster_id=None, **kwargs)[source]¶
Bases:
airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook
Interact with Amazon EMR Containers (Amazon EMR on EKS).
Provide thick wrapper around
boto3.client("emr-containers")
.- Parameters
virtual_cluster_id (str | None) – Cluster ID of the EMR on EKS virtual cluster
Additional arguments (such as
aws_conn_id
) may be specified and are passed down to the underlying AwsBaseHook.- create_emr_on_eks_cluster(virtual_cluster_name, eks_cluster_name, eks_namespace, tags=None)[source]¶
- submit_job(name, execution_role_arn, release_label, job_driver, configuration_overrides=None, client_request_token=None, tags=None, retry_max_attempts=None)[source]¶
Submit a job to the EMR Containers API and return the job ID.
A job run is a unit of work, such as a Spark jar, PySpark script, or SparkSQL query, that you submit to Amazon EMR on EKS.
See also
- Parameters
name (str) – The name of the job run.
execution_role_arn (str) – The IAM role ARN associated with the job run.
release_label (str) – The Amazon EMR release version to use for the job run.
job_driver (dict) – Job configuration details, e.g. the Spark job parameters.
configuration_overrides (dict | None) – The configuration overrides for the job run, specifically either application configuration or monitoring configuration.
client_request_token (str | None) – The client idempotency token of the job run request. Use this if you want to specify a unique ID to prevent two jobs from getting started.
tags (dict | None) – The tags assigned to job runs.
retry_max_attempts (int | None) – The maximum number of attempts on the job’s driver.
- Returns
The ID of the job run request.
- Return type
- get_job_failure_reason(job_id)[source]¶
Fetch the reason for a job failure (e.g. error message). Returns None or reason string.
- Parameters
job_id (str) – The ID of the job run request.
- check_query_status(job_id)[source]¶
Fetch the status of submitted job run. Returns None or one of valid query states.
- Parameters
job_id (str) – The ID of the job run request.