airflow.providers.google.cloud.hooks.dataproc

This module contains a Google Cloud Dataproc hook.

Module Contents

Classes

DataProcJobBuilder

A helper class for building Dataproc job.

DataprocHook

Google Cloud Dataproc APIs.

DataprocAsyncHook

Asynchronous interaction with Google Cloud Dataproc APIs.

class airflow.providers.google.cloud.hooks.dataproc.DataProcJobBuilder(project_id, task_id, cluster_name, job_type, properties=None)[source]

A helper class for building Dataproc job.

add_labels(labels=None)[source]

Set labels for Dataproc job.

Parameters

labels (dict | None) – Labels for the job query.

add_variables(variables=None)[source]

Set variables for Dataproc job.

Parameters

variables (dict | None) – Variables for the job query.

add_args(args=None)[source]

Set args for Dataproc job.

Parameters

args (list[str] | None) – Args for the job query.

add_query(query)[source]

Set query for Dataproc job.

Parameters

query (str) – query for the job.

add_query_uri(query_uri)[source]

Set query uri for Dataproc job.

Parameters

query_uri (str) – URI for the job query.

add_jar_file_uris(jars=None)[source]

Set jars uris for Dataproc job.

Parameters

jars (list[str] | None) – List of jars URIs

add_archive_uris(archives=None)[source]

Set archives uris for Dataproc job.

Parameters

archives (list[str] | None) – List of archives URIs

add_file_uris(files=None)[source]

Set file uris for Dataproc job.

Parameters

files (list[str] | None) – List of files URIs

add_python_file_uris(pyfiles=None)[source]

Set python file uris for Dataproc job.

Parameters

pyfiles (list[str] | None) – List of python files URIs

set_main(main_jar=None, main_class=None)[source]

Set Dataproc main class.

Parameters
  • main_jar (str | None) – URI for the main file.

  • main_class (str | None) – Name of the main class.

Raises

Exception

set_python_main(main)[source]

Set Dataproc main python file uri.

Parameters

main (str) – URI for the python main file.

set_job_name(name)[source]

Set Dataproc job name.

Job name is sanitized, replacing dots by underscores.

Parameters

name (str) – Job name.

build()[source]

Return Dataproc job.

Returns

Dataproc job

Return type

dict

class airflow.providers.google.cloud.hooks.dataproc.DataprocHook(gcp_conn_id='google_cloud_default', impersonation_chain=None, **kwargs)[source]

Bases: airflow.providers.google.common.hooks.base_google.GoogleBaseHook

Google Cloud Dataproc APIs.

All the methods in the hook where project_id is used must be called with keyword arguments rather than positional.

get_cluster_client(region=None)[source]

Create a ClusterControllerClient.

get_template_client(region=None)[source]

Create a WorkflowTemplateServiceClient.

get_job_client(region=None)[source]

Create a JobControllerClient.

get_batch_client(region=None)[source]

Create a BatchControllerClient.

get_operations_client(region)[source]

Create a OperationsClient.

wait_for_operation(operation, timeout=None, result_retry=DEFAULT)[source]

Wait for a long-lasting operation to complete.

create_cluster(region, project_id, cluster_name, cluster_config=None, virtual_cluster_config=None, labels=None, request_id=None, retry=DEFAULT, timeout=None, metadata=())[source]

Create a cluster in a specified project.

Parameters
  • project_id (str) – Google Cloud project ID that the cluster belongs to.

  • region (str) – Cloud Dataproc region in which to handle the request.

  • cluster_name (str) – Name of the cluster to create.

  • labels (dict[str, str] | None) – Labels that will be assigned to created cluster.

  • cluster_config (dict | google.cloud.dataproc_v1.Cluster | None) – The cluster config to create. If a dict is provided, it must be of the same form as the protobuf message ClusterConfig.

  • virtual_cluster_config (dict | None) – The virtual cluster config, used when creating a Dataproc cluster that does not directly control the underlying compute resources, for example, when creating a Dataproc-on-GKE cluster with VirtualClusterConfig.

  • request_id (str | None) – A unique id used to identify the request. If the server receives two CreateClusterRequest requests with the same ID, the second request will be ignored, and an operation created for the first one and stored in the backend is returned.

  • retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.

  • timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.

delete_cluster(region, cluster_name, project_id, cluster_uuid=None, request_id=None, retry=DEFAULT, timeout=None, metadata=())[source]

Delete a cluster in a project.

Parameters
  • project_id (str) – Google Cloud project ID that the cluster belongs to.

  • region (str) – Cloud Dataproc region in which to handle the request.

  • cluster_name (str) – Name of the cluster to delete.

  • cluster_uuid (str | None) – If specified, the RPC should fail if cluster with the UUID does not exist.

  • request_id (str | None) – A unique id used to identify the request. If the server receives two DeleteClusterRequest requests with the same ID, the second request will be ignored, and an operation created for the first one and stored in the backend is returned.

  • retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.

  • timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.

diagnose_cluster(region, cluster_name, project_id, tarball_gcs_dir=None, diagnosis_interval=None, jobs=None, yarn_application_ids=None, retry=DEFAULT, timeout=None, metadata=())[source]

Get cluster diagnostic information.

After the operation completes, the response contains the Cloud Storage URI of the diagnostic output report containing a summary of collected diagnostics.

Parameters
  • project_id (str) – Google Cloud project ID that the cluster belongs to.

  • region (str) – Cloud Dataproc region in which to handle the request.

  • cluster_name (str) – Name of the cluster.

  • tarball_gcs_dir (str | None) – The output Cloud Storage directory for the diagnostic tarball. If not specified, a task-specific directory in the cluster’s staging bucket will be used.

  • diagnosis_interval (dict | google.type.interval_pb2.Interval | None) – Time interval in which diagnosis should be carried out on the cluster.

  • jobs (collections.abc.MutableSequence[str] | None) – Specifies a list of jobs on which diagnosis is to be performed. Format: projects/{project}/regions/{region}/jobs/{job}

  • yarn_application_ids (collections.abc.MutableSequence[str] | None) – Specifies a list of yarn applications on which diagnosis is to be performed.

  • retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.

  • timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.

get_cluster(region, cluster_name, project_id, retry=DEFAULT, timeout=None, metadata=())[source]

Get the resource representation for a cluster in a project.

Parameters
  • project_id (str) – Google Cloud project ID that the cluster belongs to.

  • region (str) – Cloud Dataproc region to handle the request.

  • cluster_name (str) – The cluster name.

  • retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.

  • timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.

list_clusters(region, filter_, project_id, page_size=None, retry=DEFAULT, timeout=None, metadata=())[source]

List all regions/{region}/clusters in a project.

Parameters
  • project_id (str) – Google Cloud project ID that the cluster belongs to.

  • region (str) – Cloud Dataproc region to handle the request.

  • filter – To constrain the clusters to. Case-sensitive.

  • page_size (int | None) – The maximum number of resources contained in the underlying API response. If page streaming is performed per-resource, this parameter does not affect the return value. If page streaming is performed per-page, this determines the maximum number of resources in a page.

  • retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.

  • timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.

update_cluster(cluster_name, cluster, update_mask, project_id, region, graceful_decommission_timeout=None, request_id=None, retry=DEFAULT, timeout=None, metadata=())[source]

Update a cluster in a project.

Parameters
  • project_id (str) – Google Cloud project ID that the cluster belongs to.

  • region (str) – Cloud Dataproc region to handle the request.

  • cluster_name (str) – The cluster name.

  • cluster (dict | google.cloud.dataproc_v1.Cluster) – Changes to the cluster. If a dict is provided, it must be of the same form as the protobuf message Cluster.

  • update_mask (dict | google.protobuf.field_mask_pb2.FieldMask) –

    Specifies the path, relative to Cluster, of the field to update. For example, to change the number of workers in a cluster to 5, this would be specified as config.worker_config.num_instances, and the PATCH request body would specify the new value:

    {"config": {"workerConfig": {"numInstances": "5"}}}
    

    Similarly, to change the number of preemptible workers in a cluster to 5, this would be config.secondary_worker_config.num_instances and the PATCH request body would be:

    {"config": {"secondaryWorkerConfig": {"numInstances": "5"}}}
    

    If a dict is provided, it must be of the same form as the protobuf message FieldMask.

  • graceful_decommission_timeout (dict | google.protobuf.duration_pb2.Duration | None) –

    Timeout for graceful YARN decommissioning. Graceful decommissioning allows removing nodes from the cluster without interrupting jobs in progress. Timeout specifies how long to wait for jobs in progress to finish before forcefully removing nodes (and potentially interrupting jobs). Default timeout is 0 (for forceful decommission), and the maximum allowed timeout is one day.

    Only supported on Dataproc image versions 1.2 and higher.

    If a dict is provided, it must be of the same form as the protobuf message Duration.

  • request_id (str | None) – A unique id used to identify the request. If the server receives two UpdateClusterRequest requests with the same ID, the second request will be ignored, and an operation created for the first one and stored in the backend is returned.

  • retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.

  • timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.

start_cluster(region, project_id, cluster_name, cluster_uuid=None, request_id=None, retry=DEFAULT, timeout=None, metadata=())[source]

Start a cluster in a project.

Parameters
  • region (str) – Cloud Dataproc region to handle the request.

  • project_id (str) – Google Cloud project ID that the cluster belongs to.

  • cluster_name (str) – The cluster name.

  • cluster_uuid (str | None) – The cluster UUID

  • request_id (str | None) – A unique id used to identify the request. If the server receives two UpdateClusterRequest requests with the same ID, the second request will be ignored, and an operation created for the first one and stored in the backend is returned.

  • retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.

  • timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.

Returns

An instance of google.api_core.operation.Operation

Return type

google.api_core.operation.Operation

stop_cluster(region, project_id, cluster_name, cluster_uuid=None, request_id=None, retry=DEFAULT, timeout=None, metadata=())[source]

Start a cluster in a project.

Parameters
  • region (str) – Cloud Dataproc region to handle the request.

  • project_id (str) – Google Cloud project ID that the cluster belongs to.

  • cluster_name (str) – The cluster name.

  • cluster_uuid (str | None) – The cluster UUID

  • request_id (str | None) – A unique id used to identify the request. If the server receives two UpdateClusterRequest requests with the same ID, the second request will be ignored, and an operation created for the first one and stored in the backend is returned.

  • retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.

  • timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.

Returns

An instance of google.api_core.operation.Operation

Return type

google.api_core.operation.Operation

create_workflow_template(template, project_id, region, retry=DEFAULT, timeout=None, metadata=())[source]

Create a new workflow template.

Parameters
  • project_id (str) – Google Cloud project ID that the cluster belongs to.

  • region (str) – Cloud Dataproc region to handle the request.

  • template (dict | google.cloud.dataproc_v1.WorkflowTemplate) – The Dataproc workflow template to create. If a dict is provided, it must be of the same form as the protobuf message WorkflowTemplate.

  • retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.

  • timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.

instantiate_workflow_template(template_name, project_id, region, version=None, request_id=None, parameters=None, retry=DEFAULT, timeout=None, metadata=())[source]

Instantiate a template and begins execution.

Parameters
  • template_name (str) – Name of template to instantiate.

  • project_id (str) – Google Cloud project ID that the cluster belongs to.

  • region (str) – Cloud Dataproc region to handle the request.

  • version (int | None) – Version of workflow template to instantiate. If specified, the workflow will be instantiated only if the current version of the workflow template has the supplied version. This option cannot be used to instantiate a previous version of workflow template.

  • request_id (str | None) – A tag that prevents multiple concurrent workflow instances with the same tag from running. This mitigates risk of concurrent instances started due to retries.

  • parameters (dict[str, str] | None) – Map from parameter names to values that should be used for those parameters. Values may not exceed 100 characters.

  • retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.

  • timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.

instantiate_inline_workflow_template(template, project_id, region, request_id=None, retry=DEFAULT, timeout=None, metadata=())[source]

Instantiate a template and begin execution.

Parameters
  • template (dict | google.cloud.dataproc_v1.WorkflowTemplate) – The workflow template to instantiate. If a dict is provided, it must be of the same form as the protobuf message WorkflowTemplate.

  • project_id (str) – Google Cloud project ID that the cluster belongs to.

  • region (str) – Cloud Dataproc region to handle the request.

  • request_id (str | None) – A tag that prevents multiple concurrent workflow instances with the same tag from running. This mitigates risk of concurrent instances started due to retries.

  • retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.

  • timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.

wait_for_job(job_id, project_id, region, wait_time=10, timeout=None)[source]

Poll a job to check if it has finished.

Parameters
  • job_id (str) – Dataproc job ID.

  • project_id (str) – Google Cloud project ID that the cluster belongs to.

  • region (str) – Cloud Dataproc region to handle the request.

  • wait_time (int) – Number of seconds between checks.

  • timeout (int | None) – How many seconds wait for job to be ready.

get_job(job_id, project_id, region, retry=DEFAULT, timeout=None, metadata=())[source]

Get the resource representation for a job in a project.

Parameters
  • job_id (str) – Dataproc job ID.

  • project_id (str) – Google Cloud project ID that the cluster belongs to.

  • region (str) – Cloud Dataproc region to handle the request.

  • retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.

  • timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.

submit_job(job, project_id, region, request_id=None, retry=DEFAULT, timeout=None, metadata=())[source]

Submit a job to a cluster.

Parameters
  • job (dict | google.cloud.dataproc_v1.Job) – The job resource. If a dict is provided, it must be of the same form as the protobuf message Job.

  • project_id (str) – Google Cloud project ID that the cluster belongs to.

  • region (str) – Cloud Dataproc region to handle the request.

  • request_id (str | None) – A tag that prevents multiple concurrent workflow instances with the same tag from running. This mitigates risk of concurrent instances started due to retries.

  • retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.

  • timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.

cancel_job(job_id, project_id, region=None, retry=DEFAULT, timeout=None, metadata=())[source]

Start a job cancellation request.

Parameters
  • project_id (str) – Google Cloud project ID that the cluster belongs to.

  • region (str | None) – Cloud Dataproc region to handle the request.

  • job_id (str) – The job ID.

  • retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.

  • timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.

create_batch(region, project_id, batch, batch_id=None, request_id=None, retry=DEFAULT, timeout=None, metadata=())[source]

Create a batch workload.

Parameters
  • project_id (str) – Google Cloud project ID that the cluster belongs to.

  • region (str) – Cloud Dataproc region to handle the request.

  • batch (dict | google.cloud.dataproc_v1.Batch) – The batch to create.

  • batch_id (str | None) – The ID to use for the batch, which will become the final component of the batch’s resource name. This value must be of 4-63 characters. Valid characters are [a-z][0-9]-.

  • request_id (str | None) – A unique id used to identify the request. If the server receives two CreateBatchRequest requests with the same ID, the second request will be ignored, and an operation created for the first one and stored in the backend is returned.

  • retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.

  • timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.

delete_batch(batch_id, region, project_id, retry=DEFAULT, timeout=None, metadata=())[source]

Delete the batch workload resource.

Parameters
  • batch_id (str) – The batch ID.

  • project_id (str) – Google Cloud project ID that the cluster belongs to.

  • region (str) – Cloud Dataproc region to handle the request.

  • retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.

  • timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.

get_batch(batch_id, region, project_id, retry=DEFAULT, timeout=None, metadata=())[source]

Get the batch workload resource representation.

Parameters
  • batch_id (str) – The batch ID.

  • project_id (str) – Google Cloud project ID that the cluster belongs to.

  • region (str) – Cloud Dataproc region to handle the request.

  • retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.

  • timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.

list_batches(region, project_id, page_size=None, page_token=None, retry=DEFAULT, timeout=None, metadata=(), filter=None, order_by=None)[source]

List batch workloads.

Parameters
  • project_id (str) – Google Cloud project ID that the cluster belongs to.

  • region (str) – Cloud Dataproc region to handle the request.

  • page_size (int | None) – The maximum number of batches to return in each response. The service may return fewer than this value. The default page size is 20; the maximum page size is 1000.

  • page_token (str | None) – A page token received from a previous ListBatches call. Provide this token to retrieve the subsequent page.

  • retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.

  • timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.

  • filter (str | None) – Result filters as specified in ListBatchesRequest

  • order_by (str | None) – How to order results as specified in ListBatchesRequest

wait_for_batch(batch_id, region, project_id, wait_check_interval=10, retry=DEFAULT, timeout=None, metadata=())[source]

Wait for a batch job to complete.

After submission of a batch job, the operator waits for the job to complete. This hook is, however, useful in the case when Airflow is restarted or the task pid is killed for any reason. In this case, the creation would happen again, catching the raised AlreadyExists, and fail to this function for waiting on completion.

Parameters
  • batch_id (str) – The batch ID.

  • region (str) – Cloud Dataproc region to handle the request.

  • project_id (str) – Google Cloud project ID that the cluster belongs to.

  • wait_check_interval (int) – The amount of time to pause between checks for job completion.

  • retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.

  • timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.

class airflow.providers.google.cloud.hooks.dataproc.DataprocAsyncHook(gcp_conn_id='google_cloud_default', impersonation_chain=None, **kwargs)[source]

Bases: airflow.providers.google.common.hooks.base_google.GoogleBaseHook

Asynchronous interaction with Google Cloud Dataproc APIs.

All the methods in the hook where project_id is used must be called with keyword arguments rather than positional.

get_cluster_client(region=None)[source]

Create a ClusterControllerAsyncClient.

get_template_client(region=None)[source]

Create a WorkflowTemplateServiceAsyncClient.

get_job_client(region=None)[source]

Create a JobControllerAsyncClient.

get_batch_client(region=None)[source]

Create a BatchControllerAsyncClient.

get_operations_client(region)[source]

Create a OperationsClient.

async create_cluster(region, project_id, cluster_name, cluster_config=None, virtual_cluster_config=None, labels=None, request_id=None, retry=DEFAULT, timeout=None, metadata=())[source]

Create a cluster in a project.

Parameters
  • project_id (str) – Google Cloud project ID that the cluster belongs to.

  • region (str) – Cloud Dataproc region in which to handle the request.

  • cluster_name (str) – Name of the cluster to create.

  • labels (dict[str, str] | None) – Labels that will be assigned to created cluster.

  • cluster_config (dict | google.cloud.dataproc_v1.Cluster | None) – The cluster config to create. If a dict is provided, it must be of the same form as the protobuf message ClusterConfig.

  • virtual_cluster_config (dict | None) – The virtual cluster config, used when creating a Dataproc cluster that does not directly control the underlying compute resources, for example, when creating a Dataproc-on-GKE cluster with VirtualClusterConfig.

  • request_id (str | None) – A unique id used to identify the request. If the server receives two CreateClusterRequest requests with the same ID, the second request will be ignored, and an operation created for the first one and stored in the backend is returned.

  • retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.

  • timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.

async delete_cluster(region, cluster_name, project_id, cluster_uuid=None, request_id=None, retry=DEFAULT, timeout=None, metadata=())[source]

Delete a cluster in a project.

Parameters
  • project_id (str) – Google Cloud project ID that the cluster belongs to.

  • region (str) – Cloud Dataproc region in which to handle the request.

  • cluster_name (str) – Name of the cluster to delete.

  • cluster_uuid (str | None) – If specified, the RPC should fail if cluster with the UUID does not exist.

  • request_id (str | None) – A unique id used to identify the request. If the server receives two DeleteClusterRequest requests with the same ID, the second request will be ignored, and an operation created for the first one and stored in the backend is returned.

  • retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.

  • timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.

async diagnose_cluster(region, cluster_name, project_id, tarball_gcs_dir=None, diagnosis_interval=None, jobs=None, yarn_application_ids=None, retry=DEFAULT, timeout=None, metadata=())[source]

Get cluster diagnostic information.

After the operation completes, the response contains the Cloud Storage URI of the diagnostic output report containing a summary of collected diagnostics.

Parameters
  • project_id (str) – Google Cloud project ID that the cluster belongs to.

  • region (str) – Cloud Dataproc region in which to handle the request.

  • cluster_name (str) – Name of the cluster.

  • tarball_gcs_dir (str | None) – The output Cloud Storage directory for the diagnostic tarball. If not specified, a task-specific directory in the cluster’s staging bucket will be used.

  • diagnosis_interval (dict | google.type.interval_pb2.Interval | None) – Time interval in which diagnosis should be carried out on the cluster.

  • jobs (collections.abc.MutableSequence[str] | None) – Specifies a list of jobs on which diagnosis is to be performed. Format: projects/{project}/regions/{region}/jobs/{job}

  • yarn_application_ids (collections.abc.MutableSequence[str] | None) – Specifies a list of yarn applications on which diagnosis is to be performed.

  • retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.

  • timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.

async get_cluster(region, cluster_name, project_id, retry=DEFAULT, timeout=None, metadata=())[source]

Get the resource representation for a cluster in a project.

Parameters
  • project_id (str) – Google Cloud project ID that the cluster belongs to.

  • region (str) – Cloud Dataproc region to handle the request.

  • cluster_name (str) – The cluster name.

  • retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.

  • timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.

async list_clusters(region, filter_, project_id, page_size=None, retry=DEFAULT, timeout=None, metadata=())[source]

List all regions/{region}/clusters in a project.

Parameters
  • project_id (str) – Google Cloud project ID that the cluster belongs to.

  • region (str) – Cloud Dataproc region to handle the request.

  • filter – To constrain the clusters to. Case-sensitive.

  • page_size (int | None) – The maximum number of resources contained in the underlying API response. If page streaming is performed per-resource, this parameter does not affect the return value. If page streaming is performed per-page, this determines the maximum number of resources in a page.

  • retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.

  • timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.

async update_cluster(cluster_name, cluster, update_mask, project_id, region, graceful_decommission_timeout=None, request_id=None, retry=DEFAULT, timeout=None, metadata=())[source]

Update a cluster in a project.

Parameters
  • project_id (str) – Google Cloud project ID that the cluster belongs to.

  • region (str) – Cloud Dataproc region to handle the request.

  • cluster_name (str) – The cluster name.

  • cluster (dict | google.cloud.dataproc_v1.Cluster) – Changes to the cluster. If a dict is provided, it must be of the same form as the protobuf message Cluster.

  • update_mask (dict | google.protobuf.field_mask_pb2.FieldMask) –

    Specifies the path, relative to Cluster, of the field to update. For example, to change the number of workers in a cluster to 5, this would be specified as config.worker_config.num_instances, and the PATCH request body would specify the new value:

    {"config": {"workerConfig": {"numInstances": "5"}}}
    

    Similarly, to change the number of preemptible workers in a cluster to 5, this would be config.secondary_worker_config.num_instances and the PATCH request body would be:

    {"config": {"secondaryWorkerConfig": {"numInstances": "5"}}}
    

    If a dict is provided, it must be of the same form as the protobuf message FieldMask.

  • graceful_decommission_timeout (dict | google.protobuf.duration_pb2.Duration | None) –

    Timeout for graceful YARN decommissioning. Graceful decommissioning allows removing nodes from the cluster without interrupting jobs in progress. Timeout specifies how long to wait for jobs in progress to finish before forcefully removing nodes (and potentially interrupting jobs). Default timeout is 0 (for forceful decommission), and the maximum allowed timeout is one day.

    Only supported on Dataproc image versions 1.2 and higher.

    If a dict is provided, it must be of the same form as the protobuf message Duration.

  • request_id (str | None) – A unique id used to identify the request. If the server receives two UpdateClusterRequest requests with the same ID, the second request will be ignored, and an operation created for the first one and stored in the backend is returned.

  • retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.

  • timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.

async create_workflow_template(template, project_id, region, retry=DEFAULT, timeout=None, metadata=())[source]

Create a new workflow template.

Parameters
  • project_id (str) – Google Cloud project ID that the cluster belongs to.

  • region (str) – Cloud Dataproc region to handle the request.

  • template (dict | google.cloud.dataproc_v1.WorkflowTemplate) – The Dataproc workflow template to create. If a dict is provided, it must be of the same form as the protobuf message WorkflowTemplate.

  • retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.

  • timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.

async instantiate_workflow_template(template_name, project_id, region, version=None, request_id=None, parameters=None, retry=DEFAULT, timeout=None, metadata=())[source]

Instantiate a template and begins execution.

Parameters
  • template_name (str) – Name of template to instantiate.

  • project_id (str) – Google Cloud project ID that the cluster belongs to.

  • region (str) – Cloud Dataproc region to handle the request.

  • version (int | None) – Version of workflow template to instantiate. If specified, the workflow will be instantiated only if the current version of the workflow template has the supplied version. This option cannot be used to instantiate a previous version of workflow template.

  • request_id (str | None) – A tag that prevents multiple concurrent workflow instances with the same tag from running. This mitigates risk of concurrent instances started due to retries.

  • parameters (dict[str, str] | None) – Map from parameter names to values that should be used for those parameters. Values may not exceed 100 characters.

  • retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.

  • timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.

async instantiate_inline_workflow_template(template, project_id, region, request_id=None, retry=DEFAULT, timeout=None, metadata=())[source]

Instantiate a template and begin execution.

Parameters
  • template (dict | google.cloud.dataproc_v1.WorkflowTemplate) – The workflow template to instantiate. If a dict is provided, it must be of the same form as the protobuf message WorkflowTemplate.

  • project_id (str) – Google Cloud project ID that the cluster belongs to.

  • region (str) – Cloud Dataproc region to handle the request.

  • request_id (str | None) – A tag that prevents multiple concurrent workflow instances with the same tag from running. This mitigates risk of concurrent instances started due to retries.

  • retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.

  • timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.

async get_operation(region, operation_name)[source]
async get_job(job_id, project_id, region, retry=DEFAULT, timeout=None, metadata=())[source]

Get the resource representation for a job in a project.

Parameters
  • job_id (str) – Dataproc job ID.

  • project_id (str) – Google Cloud project ID that the cluster belongs to.

  • region (str) – Cloud Dataproc region to handle the request.

  • retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.

  • timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.

async submit_job(job, project_id, region, request_id=None, retry=DEFAULT, timeout=None, metadata=())[source]

Submit a job to a cluster.

Parameters
  • job (dict | google.cloud.dataproc_v1.Job) – The job resource. If a dict is provided, it must be of the same form as the protobuf message Job.

  • project_id (str) – Google Cloud project ID that the cluster belongs to.

  • region (str) – Cloud Dataproc region to handle the request.

  • request_id (str | None) – A tag that prevents multiple concurrent workflow instances with the same tag from running. This mitigates risk of concurrent instances started due to retries.

  • retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.

  • timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.

async cancel_job(job_id, project_id, region=None, retry=DEFAULT, timeout=None, metadata=())[source]

Start a job cancellation request.

Parameters
  • project_id (str) – Google Cloud project ID that the cluster belongs to.

  • region (str | None) – Cloud Dataproc region to handle the request.

  • job_id (str) – The job ID.

  • retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.

  • timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.

async create_batch(region, project_id, batch, batch_id=None, request_id=None, retry=DEFAULT, timeout=None, metadata=())[source]

Create a batch workload.

Parameters
  • project_id (str) – Google Cloud project ID that the cluster belongs to.

  • region (str) – Cloud Dataproc region to handle the request.

  • batch (dict | google.cloud.dataproc_v1.Batch) – The batch to create.

  • batch_id (str | None) – The ID to use for the batch, which will become the final component of the batch’s resource name. This value must be of 4-63 characters. Valid characters are [a-z][0-9]-.

  • request_id (str | None) – A unique id used to identify the request. If the server receives two CreateBatchRequest requests with the same ID, the second request will be ignored, and an operation created for the first one and stored in the backend is returned.

  • retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.

  • timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.

async delete_batch(batch_id, region, project_id, retry=DEFAULT, timeout=None, metadata=())[source]

Delete the batch workload resource.

Parameters
  • batch_id (str) – The batch ID.

  • project_id (str) – Google Cloud project ID that the cluster belongs to.

  • region (str) – Cloud Dataproc region to handle the request.

  • retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.

  • timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.

async get_batch(batch_id, region, project_id, retry=DEFAULT, timeout=None, metadata=())[source]

Get the batch workload resource representation.

Parameters
  • batch_id (str) – The batch ID.

  • project_id (str) – Google Cloud project ID that the cluster belongs to.

  • region (str) – Cloud Dataproc region to handle the request.

  • retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.

  • timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.

async list_batches(region, project_id, page_size=None, page_token=None, retry=DEFAULT, timeout=None, metadata=(), filter=None, order_by=None)[source]

List batch workloads.

Parameters
  • project_id (str) – Google Cloud project ID that the cluster belongs to.

  • region (str) – Cloud Dataproc region to handle the request.

  • page_size (int | None) – The maximum number of batches to return in each response. The service may return fewer than this value. The default page size is 20; the maximum page size is 1000.

  • page_token (str | None) – A page token received from a previous ListBatches call. Provide this token to retrieve the subsequent page.

  • retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.

  • timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.

  • filter (str | None) – Result filters as specified in ListBatchesRequest

  • order_by (str | None) – How to order results as specified in ListBatchesRequest

Was this entry helpful?