Google Cloud Dataflow Operators

Dataflow is a managed service for executing a wide variety of data processing patterns. These pipelines are created using the Apache Beam programming model which allows for both batch and streaming processing.

Prerequisite Tasks

Ways to run a data pipeline

There are several ways to run a Dataflow pipeline depending on your environment, source files:

  • Non-templated pipeline: Developer can run the pipeline as a local process on the Airflow worker if you have a ‘.jar’ file for Java or a ‘ .py` file for Python. This also means that the necessary system dependencies must be installed on the worker. For Java, worker must have the JRE Runtime installed. For Python, the Python interpreter. The runtime versions must be compatible with the pipeline versions. This is the fastest way to start a pipeline, but because of its frequent problems with system dependencies, it may cause problems. See: Java SDK pipelines, Python SDK pipelines for more detailed information.

  • Templated pipeline: The programmer can make the pipeline independent of the environment by preparing a template that will then be run on a machine managed by Google. This way, changes to the environment won’t affect your pipeline. There are two types of the templates:

    • Classic templates. Developers run the pipeline and create a template. The Apache Beam SDK stages files in Cloud Storage, creates a template file (similar to job request), and saves the template file in Cloud Storage. See: Templated jobs

    • Flex Templates. Developers package the pipeline into a Docker image and then use the gcloud command-line tool to build and save the Flex Template spec file in Cloud Storage. See: Templated jobs

  • SQL pipeline: Developer can write pipeline as SQL statement and then execute it in Dataflow. See: Dataflow SQL

It is a good idea to test your pipeline using the non-templated pipeline, and then run the pipeline in production using the templates.

For details on the differences between the pipeline types, see Dataflow templates in the Google Cloud documentation.

Starting Non-templated pipeline

To create a new pipeline using the source file (JAR in Java or Python file) use the create job operators. The source file can be located on GCS or on the local filesystem. DataflowCreateJavaJobOperator or DataflowCreatePythonJobOperator

Java SDK pipelines

For Java pipeline the jar argument must be specified for DataflowCreateJavaJobOperator as it contains the pipeline to be executed on Dataflow. The JAR can be available on GCS that Airflow has the ability to download or available on the local filesystem (provide the absolute path to it).

Here is an example of creating and running a pipeline in Java with jar stored on GCS:

airflow/providers/google/cloud/example_dags/example_dataflow.py[source]

start_java_job = BeamRunJavaPipelineOperator(
    task_id="start-java-job",
    jar=GCS_JAR,
    pipeline_options={
        'output': GCS_OUTPUT,
    },
    job_class='org.apache.beam.examples.WordCount',
    dataflow_config={
        "check_if_running": CheckJobRunning.IgnoreJob,
        "location": 'europe-west3',
        "poll_sleep": 10,
    },
)

Here is an example of creating and running a pipeline in Java with jar stored on local file system:

airflow/providers/google/cloud/example_dags/example_dataflow.py[source]

jar_to_local = GCSToLocalFilesystemOperator(
    task_id="jar-to-local",
    bucket=GCS_JAR_BUCKET_NAME,
    object_name=GCS_JAR_OBJECT_NAME,
    filename="/tmp/dataflow-{{ ds_nodash }}.jar",
)

start_java_job_local = BeamRunJavaPipelineOperator(
    task_id="start-java-job-local",
    jar="/tmp/dataflow-{{ ds_nodash }}.jar",
    pipeline_options={
        'output': GCS_OUTPUT,
    },
    job_class='org.apache.beam.examples.WordCount',
    dataflow_config={
        "check_if_running": CheckJobRunning.WaitForRun,
        "location": 'europe-west3',
        "poll_sleep": 10,
    },
)
jar_to_local >> start_java_job_local

Python SDK pipelines

The py_file argument must be specified for DataflowCreatePythonJobOperator as it contains the pipeline to be executed on Dataflow. The Python file can be available on GCS that Airflow has the ability to download or available on the local filesystem (provide the absolute path to it).

The py_interpreter argument specifies the Python version to be used when executing the pipeline, the default is python3`. If your Airflow instance is running on Python 2 - specify ``python2 and ensure your py_file is in Python 2. For best results, use Python 3.

If py_requirements argument is specified a temporary Python virtual environment with specified requirements will be created and within it pipeline will run.

The py_system_site_packages argument specifies whether or not all the Python packages from your Airflow instance, will be accessible within virtual environment (if py_requirements argument is specified), recommend avoiding unless the Dataflow job requires it.

airflow/providers/google/cloud/example_dags/example_dataflow.py[source]

start_python_job = BeamRunPythonPipelineOperator(
    task_id="start-python-job",
    py_file=GCS_PYTHON,
    py_options=[],
    pipeline_options={
        'output': GCS_OUTPUT,
    },
    py_requirements=['apache-beam[gcp]==2.21.0'],
    py_interpreter='python3',
    py_system_site_packages=False,
    dataflow_config={'location': 'europe-west3'},
)

Execution models

Dataflow has multiple options of executing pipelines. It can be done in the following modes: batch asynchronously (fire and forget), batch blocking (wait until completion), or streaming (run indefinitely). In Airflow it is best practice to use asynchronous batch pipelines or streams and use sensors to listen for expected job state.

By default DataflowCreateJavaJobOperator, DataflowCreatePythonJobOperator, DataflowTemplatedJobStartOperator and DataflowStartFlexTemplateOperator have argument wait_until_finished set to None which cause different behaviour depends on the type of pipeline:

  • for the streaming pipeline, wait for jobs to start,

  • for the batch pipeline, wait for the jobs to complete.

If wait_until_finished is set to True operator will always wait for end of pipeline execution. If set to False only submits the jobs.

See: Configuring PipelineOptions for execution on the Cloud Dataflow service

Asynchronous execution

Dataflow batch jobs are by default asynchronous - however this is dependent on the application code (contained in the JAR or Python file) and how it is written. In order for the Dataflow job to execute asynchronously, ensure the pipeline objects are not being waited upon (not calling waitUntilFinish or wait_until_finish on the PipelineResult in your application code).

airflow/providers/google/cloud/example_dags/example_dataflow.py[source]

start_python_job_async = BeamRunPythonPipelineOperator(
    task_id="start-python-job-async",
    runner="DataflowRunner",
    py_file=GCS_PYTHON,
    py_options=[],
    pipeline_options={
        'output': GCS_OUTPUT,
    },
    py_requirements=['apache-beam[gcp]==2.25.0'],
    py_interpreter='python3',
    py_system_site_packages=False,
    dataflow_config={
        "job_name": "start-python-job-async",
        "location": 'europe-west3',
        "wait_until_finished": False,
    },
)

Blocking execution

In order for a Dataflow job to execute and wait until completion, ensure the pipeline objects are waited upon in the application code. This can be done for the Java SDK by calling waitUntilFinish on the PipelineResult returned from pipeline.run() or for the Python SDK by calling wait_until_finish on the PipelineResult returned from pipeline.run().

Blocking jobs should be avoided as there is a background process that occurs when run on Airflow. This process is continuously being run to wait for the Dataflow job to be completed and increases the consumption of resources by Airflow in doing so.

Streaming execution

To execute a streaming Dataflow job, ensure the streaming option is set (for Python) or read from an unbounded data source, such as Pub/Sub, in your pipeline (for Java).

Setting argument drain_pipeline to True allows to stop streaming job by draining it instead of canceling during killing task instance.

See the Stopping a running pipeline.

Templated jobs

Templates give the ability to stage a pipeline on Cloud Storage and run it from there. This provides flexibility in the development workflow as it separates the development of a pipeline from the staging and execution steps. There are two types of templates for Dataflow: Classic and Flex. See the official documentation for Dataflow templates for more information.

Here is an example of running Classic template with DataflowTemplatedJobStartOperator:

airflow/providers/google/cloud/example_dags/example_dataflow.py[source]

start_template_job = DataflowTemplatedJobStartOperator(
    task_id="start-template-job",
    template='gs://dataflow-templates/latest/Word_Count',
    parameters={'inputFile': "gs://dataflow-samples/shakespeare/kinglear.txt", 'output': GCS_OUTPUT},
    location='europe-west3',
)

See the list of Google-provided templates that can be used with this operator.

Here is an example of running Flex template with DataflowStartFlexTemplateOperator:

airflow/providers/google/cloud/example_dags/example_dataflow_flex_template.py[source]

start_flex_template = DataflowStartFlexTemplateOperator(
    task_id="start_flex_template_streaming_beam_sql",
    body={
        "launchParameter": {
            "containerSpecGcsPath": GCS_FLEX_TEMPLATE_TEMPLATE_PATH,
            "jobName": DATAFLOW_FLEX_TEMPLATE_JOB_NAME,
            "parameters": {
                "inputSubscription": PUBSUB_FLEX_TEMPLATE_SUBSCRIPTION,
                "outputTable": f"{GCP_PROJECT_ID}:{BQ_FLEX_TEMPLATE_DATASET}.streaming_beam_sql",
            },
        }
    },
    do_xcom_push=True,
    location=BQ_FLEX_TEMPLATE_LOCATION,
)

Dataflow SQL

Dataflow SQL supports a variant of the ZetaSQL query syntax and includes additional streaming extensions for running Dataflow streaming jobs.

Here is an example of running Dataflow SQL job with DataflowStartSqlJobOperator:

airflow/providers/google/cloud/example_dags/example_dataflow_sql.py[source]

start_sql = DataflowStartSqlJobOperator(
    task_id="start_sql_query",
    job_name=DATAFLOW_SQL_JOB_NAME,
    query=f"""
        SELECT
            sales_region as sales_region,
            count(state_id) as count_state
        FROM
            bigquery.table.`{GCP_PROJECT_ID}`.`{BQ_SQL_DATASET}`.`{BQ_SQL_TABLE_INPUT}`
        WHERE state_id >= @state_id_min
        GROUP BY sales_region;
    """,
    options={
        "bigquery-project": GCP_PROJECT_ID,
        "bigquery-dataset": BQ_SQL_DATASET,
        "bigquery-table": BQ_SQL_TABLE_OUTPUT,
        "bigquery-write-disposition": "write-truncate",
        "parameter": "state_id_min:INT64:2",
    },
    location=DATAFLOW_SQL_LOCATION,
    do_xcom_push=True,
)

Warning

This operator requires gcloud command (Google Cloud SDK) must be installed on the Airflow worker <https://cloud.google.com/sdk/docs/install>`__

See the Dataflow SQL reference.

Sensors

When job is triggered asynchronously sensors may be used to run checks for specific job properties.

DataflowJobStatusSensor.

airflow/providers/google/cloud/example_dags/example_dataflow.py[source]

wait_for_python_job_async_done = DataflowJobStatusSensor(
    task_id="wait-for-python-job-async-done",
    job_id="{{task_instance.xcom_pull('start-python-job-async')['dataflow_job_id']}}",
    expected_statuses={DataflowJobStatus.JOB_STATE_DONE},
    location='europe-west3',
)

DataflowJobMetricsSensor.

airflow/providers/google/cloud/example_dags/example_dataflow.py[source]

def check_metric_scalar_gte(metric_name: str, value: int) -> Callable:
    """Check is metric greater than equals to given value."""

    def callback(metrics: list[dict]) -> bool:
        dag_native_python_async.log.info("Looking for '%s' >= %d", metric_name, value)
        for metric in metrics:
            context = metric.get("name", {}).get("context", {})
            original_name = context.get("original_name", "")
            tentative = context.get("tentative", "")
            if original_name == "Service-cpu_num_seconds" and not tentative:
                return metric["scalar"] >= value
        raise AirflowException(f"Metric '{metric_name}' not found in metrics")

    return callback

wait_for_python_job_async_metric = DataflowJobMetricsSensor(
    task_id="wait-for-python-job-async-metric",
    job_id="{{task_instance.xcom_pull('start-python-job-async')['dataflow_job_id']}}",
    location='europe-west3',
    callback=check_metric_scalar_gte(metric_name="Service-cpu_num_seconds", value=100),
    fail_on_terminal_state=False,
)

DataflowJobMessagesSensor.

airflow/providers/google/cloud/example_dags/example_dataflow.py[source]

def check_message(messages: list[dict]) -> bool:
    """Check message"""
    for message in messages:
        if "Adding workflow start and stop steps." in message.get("messageText", ""):
            return True
    return False

wait_for_python_job_async_message = DataflowJobMessagesSensor(
    task_id="wait-for-python-job-async-message",
    job_id="{{task_instance.xcom_pull('start-python-job-async')['dataflow_job_id']}}",
    location='europe-west3',
    callback=check_message,
    fail_on_terminal_state=False,
)

DataflowJobAutoScalingEventsSensor.

airflow/providers/google/cloud/example_dags/example_dataflow.py[source]

def check_autoscaling_event(autoscaling_events: list[dict]) -> bool:
    """Check autoscaling event"""
    for autoscaling_event in autoscaling_events:
        if "Worker pool started." in autoscaling_event.get("description", {}).get("messageText", ""):
            return True
    return False

wait_for_python_job_async_autoscaling_event = DataflowJobAutoScalingEventsSensor(
    task_id="wait-for-python-job-async-autoscaling-event",
    job_id="{{task_instance.xcom_pull('start-python-job-async')['dataflow_job_id']}}",
    location='europe-west3',
    callback=check_autoscaling_event,
    fail_on_terminal_state=False,
)

Was this entry helpful?