airflow.contrib.operators.mysql_to_gcs
¶
Module Contents¶
-
class
airflow.contrib.operators.mysql_to_gcs.
MySqlToGoogleCloudStorageOperator
(sql, bucket, filename, schema_filename=None, approx_max_file_size_bytes=1900000000, mysql_conn_id='mysql_default', google_cloud_storage_conn_id='google_cloud_default', schema=None, delegate_to=None, export_format='json', field_delimiter=', ', *args, **kwargs)[source]¶ Bases:
airflow.models.BaseOperator
Copy data from MySQL to Google cloud storage in JSON or CSV format.
The JSON data files generated are newline-delimited to enable them to be loaded into BigQuery. Reference: https://cloud.google.com/bigquery/docs/ loading-data-cloud-storage-json#limitations
- Parameters
sql (str) – The SQL to execute on the MySQL table.
bucket (str) – The bucket to upload to.
filename (str) – The filename to use as the object name when uploading to Google cloud storage. A {} should be specified in the filename to allow the operator to inject file numbers in cases where the file is split due to size.
schema_filename (str) – If set, the filename to use as the object name when uploading a .json file containing the BigQuery schema fields for the table that was dumped from MySQL.
approx_max_file_size_bytes (long) – This operator supports the ability to split large table dumps into multiple files (see notes in the filenamed param docs above). Google cloud storage allows for files to be a maximum of 4GB. This param allows developers to specify the file size of the splits.
mysql_conn_id (str) – Reference to a specific MySQL hook.
google_cloud_storage_conn_id (str) – Reference to a specific Google cloud storage hook.
schema (str or list) – The schema to use, if any. Should be a list of dict or a str. Pass a string if using Jinja template, otherwise, pass a list of dict. Examples could be seen: https://cloud.google.com/bigquery/docs /schemas#specifying_a_json_schema_file
delegate_to (str) – The account to impersonate, if any. For this to work, the service account making the request must have domain-wide delegation enabled.
export_format (str) – Desired format of files to be exported.
field_delimiter (str) – The delimiter to be used for CSV files.
-
_write_local_data_files
(self, cursor)[source]¶ Takes a cursor, and writes results to a local file.
- Returns
A dictionary where keys are filenames to be used as object names in GCS, and values are file handles to local files that contain the data for the GCS objects.
-
_configure_csv_file
(self, file_handle, schema)[source]¶ Configure a csv writer with the file_handle and write schema as headers for the new file.
-
_write_local_schema_file
(self, cursor)[source]¶ Takes a cursor, and writes the BigQuery schema in .json format for the results to a local file system.
- Returns
A dictionary where key is a filename to be used as an object name in GCS, and values are file handles to local files that contains the BigQuery schema fields in .json format.
-
_upload_to_gcs
(self, files_to_upload)[source]¶ Upload all of the file splits (and optionally the schema .json file) to Google cloud storage.
-
static
_convert_types
(schema, col_type_dict, row)[source]¶ Takes a value from MySQLdb, and converts it to a value that’s safe for JSON/Google cloud storage/BigQuery. Dates are converted to UTC seconds. Decimals are converted to floats. Binary type fields are encoded with base64, as imported BYTES data must be base64-encoded according to Bigquery SQL date type documentation: https://cloud.google.com/bigquery/data-types