Oracle to BigQuery— Create a pipeline to extract data from Oracle and load it to BQ using Data Fusion.

Cloud Data Fusion is a fully managed, code free web based data integration service that helps build data pipelines quickly and efficiently. An instance in Data Fusion is composed of several Google cloud services such as Cloud Storage, Persistent disk, Cloud Key Management Service etc. These services are used to develop and manage pipelines build on Data Fusion.

When you run a data pipeline, data fusion automatically provisions Dataproc clusters, run pipelines, and tears down the clusters once the job is completed. User may choose to run pipelines on existing Dataproc clusters too.

Before we start extracting data from Oracle to BigQuery, navigate to APIs and Services within GCP navigation menu, enable Cloud Data Fusion, Cloud Storage, BigQuery, and Cloud Dataproc APIs. Since we are using BigQuery as sink, make sure you create a BigQuery dataset.

When using Cloud Data Fusion, you use both the Cloud Console and the separate Cloud Data Fusion UI. In the Cloud Console, you can create a Google Cloud project, create and delete Cloud Data Fusion instances, and view instance details. In the Cloud Data Fusion UI, you can use the various pages such as Pipeline Studio or Wrangler to utilize Cloud Data Fusion functionality.

Navigate to Cloud Data Fusion UI from GCP console. Click Create Instance, enter the instance name and description, specify region where you want the instance to run. You may choose the version or edition of Data Fusion. It will take 20–30 minutes for the instance to be created.

On creating the instance, click View Instance under action.

Data Fusion under GCP Console — Create, delete and modify instance

Clicking on view instance takes you to the separate Cloud Data Fusion UI.

On clicking View Instance (Refer above image)
Data Fusion Web UI

To connect to Oracle DB using Data Fusion, you need to install JDBC Driver. To do so, you can click HUB on the Data Fusion UI. Depending upon your source, you may choose the driver to install. In our case, it would be Oracle. Follow the instructions to download and deploy the driver.

Click HUB on top of the UI
Navigate to Drivers and click the preferred DB of your choice

Alternatively, from Data Fusion UI, navigate to studio by clicking menu and click “+” button to upload the driver from local desktop or a shared drive.

Upload driver using Studio Console

On setting up JDBC, navigate back to HUB to deploy Oracle Plugins in order to read and write from an Oracle instance.

Install Plug-In to enable read and from DB

You can now connect to the On-Prem Oracle DB by navigating back to studio page, click Oracle under Source options to extract required data. And under Sink, click BigQuery to load the data.

Source and Sink selection from Data Fusion UI

Ensure to change properties of both Source and Sink by hovering mouse over the template. You can write a query under Properties -> Import Query to extract data from Source. Similarly, you can load data into BigQuery by defining valid properties under Sink.

You are now all set to extract and load data from Oracle to BigQuery by running the above pipeline.

Some key considerations

  • You can choose to create private instance over private VPC network to securely access your on premise data. (Create VPC network via VPC networks page from Cloud Console).
  • It is always a best practice to do any transformations such as joins before loading data to BigQuery. This helps lower the storage requirements.
  • For pipelines, you can configure the schedule run. You can very well configure Error handlers and alerts as needed.
  • Data Fusion comes with limited number of transformations. Custom tasks that cannot be performed using Data Fusion can be written in Python and integrated into Cloud Composer DAG for an end to end orchestration of the workflow.
  • Apart from using Cloud Console, you can deploy pipelines using REST API.
  • By enabling Cloud Logging during the creation of Data Fusion instance, you can view logs from Logs Viewer in Stackdriver.

Lead Programmer at Novartis Healthcare Pvt Ltd.