Automating the data movement from GCS to BQ using Data Transfer Service

BigQuery Data Transfer Service is an offering by GCP that automates the movement of data from different external sources into BigQuery under given conditions on periodic basis.

Image picked from Google Cloud Documentation.

Currently DTS supports few Google SaaS apps such as Campaign Manager, Google Ads etc., Amazon S3 (Cloud storage), and couple of data warehouses such as Teradata and Amazon Redshift. You can also automate data load from Salesforce CRM and Adobe Analytics using third party transfers for DTS available under Google Cloud Marketplace.

Examples of third party transfers available currently in Google Market Place.

Like any other GCP service, DTS can be initiated from Cloud Console, CLI, or an API. On configuring the data transfer, data will be automatically loaded into BigQuery as per the schedule.

To use this service and create transfers you should be grated bigquery.admin role. Under APIs & Services console, ensure you enable BigQuery API as well as BigQuery Data Transfer Service.

In this example, Let us assume we have a cloud storage bucket with URI gs://bucket-dts-bq/ which has several csv files and there’s a destination table created in BigQuery as dtstransfer.salesdata.

Navigate to BigQuery page, and click Data Transfers under the left pane. At this stage I assume you have already enabled APIs and Services mentioned above.

Create A Transfer from BQ data transfer screen

On create transfer page, you would need to make selections required for your data transfer.

  • Select the source type, which in our case is Google cloud storage. You can choose from a variety of sources, or even select third party transfers from Google Cloud Marketplace.
  • Give a name to your transfer job under Transfer config name. You would need to choose a schedule option, which enables a trigger for the data load. You may choose to load data at any interval you want. By default, DTS allows Daily, Weekly and Monthly. For example, if you want to load data bi-weekly, you may choose Custom.
You can even change the time zone depending upon where your data is loading from
  • You can select the data source details such as destination table, cloud storage URI etc. Under Write preferences, you may choose to either append or mirror (Please note mirror overwrites the fresh batch of data every time the scheduler runs).
  • You can choose to delete the source file after each successful transfer by checking the option available. You may even choose to receive email notifications about the transfer run.
Selecting data source and providing write preferences.

Imagine there are problems with source data, or there is an outage at the source. You can initiate data backfills. This process is called Refreshing and you can initiate it from your data transfer console. Please note this option is enabled only for runtime parameterized transfer configurations.

Parameterization is process of setting up runtime parameters in order to load data by run_time. Using runtime parameters, you can specify how you want to partition the destination table. You can easily retrieve files by matching a particular run_time.

For example, salesdata$20210403 (Table_name$Run_Time) is the file loaded into partitioned BQ table from GCS at specific run time.

The above assumptions apply here as well.

#Set transfer configuration — Wild card (*) enables to read all files in GCS bucket instead of listing each file individually.#Make transfer configuration same way we create DTS from mk --transfer_config -- data_source=google_cloud_storage \ 
-- target_dataset= dtstransfer -- display_name “gcs-bq-transfer” \
-- params=’{“data_path_template”:” gs://bucket-dts-bq/*.csv”, “destination_table_name_template”:”salesdata”, “file_format”:”CSV”,“max_bad_records”:”1", “skip_leading_rows”:”1", “delete_source_files”:”true”}’\ -- data_source=google_cloud_storage
  1. Whether you are appending the table or supplying wildcard option, it is important that the files share same schema you define in the destination table.
  2. You should ensure to set up the destination table along with schema before configuring the transfer job.
  3. Unless required, it is not advisable to mirror the data. It is always preferable to append the tables than mirror.
  4. Limitations vary depending on the file formats you choose to transfer. Please read documentation.
  5. For all transfers standard BQ storage pricing applies. In some cases, your data will not be deleted from GCS bucket unless you delete it. Additional GCS costs apply here.
  6. Ensure to disable or delete the transfer job if it is no more required to avoid data issues.

Lead Programmer at Novartis Healthcare Pvt Ltd.