`airflow.providers.microsoft.azure.hooks.data_lake`¶

This module contains integration with Azure Data Lake.

AzureDataLakeHook communicates via a REST API compatible with WebHDFS. Make sure that a Airflow connection of type azure_data_lake exists. Authorization can be done by supplying a login (=Client ID), password (=Client Secret) and extra fields tenant (Tenant) and account_name (Account Name) (see connection azure_data_lake_default for an example).

Module Contents¶

Classes¶

AzureDataLakeHook

Interacts with Azure Data Lake.

class airflow.providers.microsoft.azure.hooks.data_lake.AzureDataLakeHook(azure_data_lake_conn_id: str = default_conn_name)[source]¶

Bases: airflow.hooks.base.BaseHook

Interacts with Azure Data Lake.

Client ID and client secret should be in user and password parameters. Tenant and account name should be extra field as {"tenant": "<TENANT>", "account_name": "ACCOUNT_NAME"}.

Parameters: azure_data_lake_conn_id (str) -- Reference to the Azure Data Lake connection.

conn_name_attr = azure_data_lake_conn_id[source]¶

default_conn_name = azure_data_lake_default[source]¶

conn_type = azure_data_lake[source]¶

hook_name = Azure Data Lake[source]¶

static get_connection_form_widgets() → Dict[str, Any][source]¶: Returns connection widgets to add to connection form

static get_ui_field_behaviour() → Dict[source]¶: Returns custom field behaviour

get_conn(self) → azure.datalake.store.core.AzureDLFileSystem[source]¶: Return a AzureDLFileSystem object.

check_for_file(self, file_path: str) → bool[source]¶

Check if a file exists on Azure Data Lake.

Parameters: file_path (str) -- Path and name of the file.
Returns: True if the file exists, False otherwise.
Return type: bool

upload_file(self, local_path: str, remote_path: str, nthreads: int = 64, overwrite: bool = True, buffersize: int = 4194304, blocksize: int = 4194304, **kwargs) → None[source]¶

Upload a file to Azure Data Lake.

Parameters

local_path (str) -- local path. Can be single file, directory (in which case, upload recursively) or glob pattern. Recursive glob patterns using ** are not supported.
remote_path (str) -- Remote path to upload to; if multiple files, this is the directory root to write within.
nthreads (int) -- Number of threads to use. If None, uses the number of cores.
overwrite (bool) -- Whether to forcibly overwrite existing files/directories. If False and remote path is a directory, will quit regardless if any files would be overwritten or not. If True, only matching filenames are actually overwritten.
buffersize (int) -- int [2**22] Number of bytes for internal buffer. This block cannot be bigger than a chunk and cannot be smaller than a block.
blocksize (int) -- int [2**22] Number of bytes for a block. Within each chunk, we write a smaller block for each API call. This block cannot be bigger than a chunk.

download_file(self, local_path: str, remote_path: str, nthreads: int = 64, overwrite: bool = True, buffersize: int = 4194304, blocksize: int = 4194304, **kwargs) → None[source]¶

Download a file from Azure Blob Storage.

Parameters

local_path (str) -- local path. If downloading a single file, will write to this specific file, unless it is an existing directory, in which case a file is created within it. If downloading multiple files, this is the root directory to write within. Will create directories as required.
remote_path (str) -- remote path/globstring to use to find remote files. Recursive glob patterns using ** are not supported.
nthreads (int) -- Number of threads to use. If None, uses the number of cores.
overwrite (bool) -- Whether to forcibly overwrite existing files/directories. If False and remote path is a directory, will quit regardless if any files would be overwritten or not. If True, only matching filenames are actually overwritten.
buffersize (int) -- int [2**22] Number of bytes for internal buffer. This block cannot be bigger than a chunk and cannot be smaller than a block.
blocksize (int) -- int [2**22] Number of bytes for a block. Within each chunk, we write a smaller block for each API call. This block cannot be bigger than a chunk.

list(self, path: str) → list[source]¶

List files in Azure Data Lake Storage

Parameters: path (str) -- full path/globstring to use to list files in ADLS

remove(self, path: str, recursive: bool = False, ignore_not_found: bool = True) → None[source]¶

Remove files in Azure Data Lake Storage

Parameters

path (str) -- A directory or file to remove in ADLS
recursive (bool) -- Whether to loop into directories in the location and remove the files
ignore_not_found (bool) -- Whether to raise error if file to delete is not found

airflow.providers.microsoft.azure.hooks.data_lake¶

Module Contents¶

Classes¶

`airflow.providers.microsoft.azure.hooks.data_lake`¶