Building an ETL data pipeline: GCS-Dataprep-BigQuery

Source: Trifacta webpage

Cloud Dataprep is a managed service from Google Cloud and is a partner service developed by Trifacta. Cloud Dataprep help analysts reduce time required to prepare data for analysis. Within the console, you have tools that help explore, cleanse, and transform data without configuring or provisioning servers.

Dataprep comes in three editions. Cloud Dataprep Premium, Standard and Legacy. Dataprep jobs run on Dataflow. When the job execution completes, Dataflow creates a template file in Cloud Storage (Temp Directory).

Let us look at how we can use Dataprep to load data from GCS to BigQuery with a few transformations. But before we do that there are some prerequisites to fulfill:

  • From APIs and Services, enable Cloud Dataflow, BigQuery and Cloud Storage APIs.
  • Since we are loading data from GCS to BigQuery, ensure we have a cloud storage bucket as well as a BQ dataset.

Apart from the above, you need to set up a staging bucket used for staging assets. The first time you enable Cloud Dataprep a Storage staging bucket is automatically created.

From Google Cloud Console, navigation menu, click Dataprep:

Navigate to Dataprep from Cloud Console

Clicking on Dataprep would open a new tab where for the first time you need to setup staging area that we have discussed above.

Dataprep homepage

From the home page, you may choose to Start from Scratch or Start from a Template. I choose Import Data from home screen and selected GCS as source.

Import Data from GCS using Dataprep

On selecting required data from GCS bucket, you can click Add to New Flow radio button:

Create a flow from imported data

Clicking Continue would take you to a page where you can setup the flow of the dataset you loaded. I choose to add recipe (Recipes are the ETL steps that I intend to perform on the data).

When you click Add Recipe or Edit Recipe, Cloud Dataprep opens up a new screen where it creates a dashboard automatically. On the top of each columns are indicators that show distributions, missing values, and data types. You can further drill down to see more details.

Indicators, recipes, Data properties etc.

Within the same page, you can click Recipe on top and add New Step to include a transformation. I have used a few very basic recipes or transformations for the demonstration purposes.

On adding all the recipes, click Run on the top, which takes you to the Run Job on Dataflow screen.

Clicking on Edit and specify BigQuery as sink and select the dataset-table where we would like to load the data.

You can also create a new table to load the data. When you are done selecting the table you want to load data, the end flow looks like:

The ETL Flow Diagram

On clicking Run, a new job would get created:

The job execution screen

This page highlights the information such as Job ID, Job Status, Job Environment, whether the job type is manual or scheduled etc. Since the job we setup above ran successfully, you can view the data within BigQuery.

View data in BigQuery table

Clicking on View Dataflow Job on the job execution screen takes you to Dataflow page in a new window where you can view the job graph and other job details.

The Dataflow Pipeline Diagram

The Job info tab within Dataflow will give information on Resource metrics, Pipeline options etc. I would recommend going through this information, which according to me is very important to understand. For example, Custom counters translate the Dataprep recipes to Dataflow transformations.

Read about product limitations here:

Lead Programmer at Novartis Healthcare Pvt Ltd.