airflow.providers.databricks.operators.databricks_sql

This module contains Databricks operators.

Module Contents

Classes

DatabricksSqlOperator

Executes SQL code in a Databricks SQL endpoint or a Databricks cluster.

DatabricksCopyIntoOperator

Executes COPY INTO command in a Databricks SQL endpoint or a Databricks cluster.

Attributes

COPY_INTO_APPROVED_FORMATS

class airflow.providers.databricks.operators.databricks_sql.DatabricksSqlOperator(*, databricks_conn_id=DatabricksSqlHook.default_conn_name, http_path=None, sql_endpoint_name=None, session_configuration=None, http_headers=None, catalog=None, schema=None, output_path=None, output_format='csv', csv_params=None, client_parameters=None, **kwargs)[source]

Bases: airflow.providers.common.sql.operators.sql.SQLExecuteQueryOperator

Executes SQL code in a Databricks SQL endpoint or a Databricks cluster.

See also

For more information on how to use this operator, take a look at the guide: DatabricksSqlOperator

Parameters
  • databricks_conn_id (str) – Reference to Databricks connection id (templated)

  • http_path (str | None) – Optional string specifying HTTP path of Databricks SQL Endpoint or cluster. If not specified, it should be either specified in the Databricks connection’s extra parameters, or sql_endpoint_name must be specified.

  • sql_endpoint_name (str | None) – Optional name of Databricks SQL Endpoint. If not specified, http_path must be provided as described above.

  • sql – the SQL code to be executed as a single string, or a list of str (sql statements), or a reference to a template file. (templated) Template references are recognized by str ending in ‘.sql’

  • parameters – (optional) the parameters to render the SQL query with.

  • session_configuration – An optional dictionary of Spark session parameters. Defaults to None. If not specified, it could be specified in the Databricks connection’s extra parameters.

  • client_parameters (dict[str, Any] | None) – Additional parameters internal to Databricks SQL Connector parameters

  • http_headers (list[tuple[str, str]] | None) – An optional list of (k, v) pairs that will be set as HTTP headers on every request. (templated)

  • catalog (str | None) – An optional initial catalog to use. Requires DBR version 9.0+ (templated)

  • schema (str | None) – An optional initial schema to use. Requires DBR version 9.0+ (templated)

  • output_path (str | None) – optional string specifying the file to which write selected data. (templated)

  • output_format (str) – format of output data if output_path` is specified. Possible values are ``csv, json, jsonl. Default is csv.

  • csv_params (dict[str, Any] | None) – parameters that will be passed to the csv.DictWriter class used to write CSV data.

template_fields: Sequence[str][source]
template_ext: Sequence[str] = ('.sql',)[source]
template_fields_renderers: ClassVar[dict][source]
conn_id_field = 'databricks_conn_id'[source]
get_db_hook()[source]

Get the database hook for the connection.

Returns

the database hook object.

Return type

airflow.providers.databricks.hooks.databricks_sql.DatabricksSqlHook

airflow.providers.databricks.operators.databricks_sql.COPY_INTO_APPROVED_FORMATS = ['CSV', 'JSON', 'AVRO', 'ORC', 'PARQUET', 'TEXT', 'BINARYFILE'][source]
class airflow.providers.databricks.operators.databricks_sql.DatabricksCopyIntoOperator(*, table_name, file_location, file_format, databricks_conn_id=DatabricksSqlHook.default_conn_name, http_path=None, sql_endpoint_name=None, session_configuration=None, http_headers=None, client_parameters=None, catalog=None, schema=None, files=None, pattern=None, expression_list=None, credential=None, storage_credential=None, encryption=None, format_options=None, force_copy=None, copy_options=None, validate=None, **kwargs)[source]

Bases: airflow.models.BaseOperator

Executes COPY INTO command in a Databricks SQL endpoint or a Databricks cluster.

COPY INTO command is constructed from individual pieces, that are described in documentation.

See also

For more information on how to use this operator, take a look at the guide: DatabricksCopyIntoOperator

Parameters
  • table_name (str) – Required name of the table. (templated)

  • file_location (str) – Required location of files to import. (templated)

  • file_format (str) – Required file format. Supported formats are CSV, JSON, AVRO, ORC, PARQUET, TEXT, BINARYFILE.

  • databricks_conn_id (str) – Reference to Databricks connection id (templated)

  • http_path (str | None) – Optional string specifying HTTP path of Databricks SQL Endpoint or cluster. If not specified, it should be either specified in the Databricks connection’s extra parameters, or sql_endpoint_name must be specified.

  • sql_endpoint_name (str | None) – Optional name of Databricks SQL Endpoint. If not specified, http_path must be provided as described above.

  • session_configuration – An optional dictionary of Spark session parameters. Defaults to None. If not specified, it could be specified in the Databricks connection’s extra parameters.

  • http_headers (list[tuple[str, str]] | None) – An optional list of (k, v) pairs that will be set as HTTP headers on every request

  • catalog (str | None) – An optional initial catalog to use. Requires DBR version 9.0+

  • schema (str | None) – An optional initial schema to use. Requires DBR version 9.0+

  • client_parameters (dict[str, Any] | None) – Additional parameters internal to Databricks SQL Connector parameters

  • files (list[str] | None) – optional list of files to import. Can’t be specified together with pattern. (templated)

  • pattern (str | None) – optional regex string to match file names to import. Can’t be specified together with files.

  • expression_list (str | None) – optional string that will be used in the SELECT expression.

  • credential (dict[str, str] | None) – optional credential configuration for authentication against a source location.

  • storage_credential (str | None) – optional Unity Catalog storage credential for destination.

  • encryption (dict[str, str] | None) – optional encryption configuration for a specified location.

  • format_options (dict[str, str] | None) – optional dictionary with options specific for a given file format.

  • force_copy (bool | None) – optional bool to control forcing of data import (could be also specified in copy_options).

  • validate (bool | int | None) – optional configuration for schema & data validation. True forces validation of all rows, integer number - validate only N first rows

  • copy_options (dict[str, str] | None) – optional dictionary of copy options. Right now only force option is supported.

template_fields: Sequence[str] = ('file_location', 'files', 'table_name', 'databricks_conn_id')[source]
execute(context)[source]

Derive when creating an operator.

Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

on_kill()[source]

Override this method to clean up subprocesses when a task instance gets killed.

Any use of the threading, subprocess or multiprocessing module within an operator needs to be cleaned up, or it will leave ghost processes behind.

Was this entry helpful?