The 3 part series of Data Ingestion using Google Cloud.

This article sets the context of ingestion and services available to ingest data within GCP and is just a preface of the 3 part series.

Data engineers often create pipelines that access data from different data sources within an organization to serve needs of business stakeholders. This data comes from various sources in varied formats and each having a different schema. Whether it is a BI Dashboard or an ML model, data pipelines help streamline the process of data building.

Every organization has their own way of working with pipelines. A typical data pipeline follow four steps as shown in the below diagram.

Typical stages of building a data pipeline.

Ingestion becomes the most critical and is an important process while building a data pipeline. Ingestion is a process to read data from data sources. Typically, ingestion can happen either as batches or through streaming.

Batch Ingestion sets the records and extracts them as a group. It is sequential and processes records according to criteria set by developers. Streaming which is an alternative data ingestion paradigm automatically pass individual records one by one. Organizations use streaming only when they need near-real-time data for use with in applications or analytics.

GCP offers various ingestion services to batch load or stream data from difference sources and further build pipelines as required.

Services available within GCP for Batch load and Streaming.

The list of ingestion services within GCP is not limited to those in figure. We can use other services such as Dataflow or Dataproc to ingest the data from external files. Which service to use depends upon the architectural design of your pipeline and data sources.

The 3 part series outlines how data can be ingested to GCP using various services. This series only concentrates on Batch load and would discuss the below topics in detail.

1. Load data directly using BigQuery UI and CLI. (Uses only Bigquery)

2. Load multiple file formats from Cloud Storage to BigQuery (Uses GCS and Bigquery)

3. Load data into BigQuery table from Cloud Storage using Python.

I look forward to publish “Part 1” of this series and could not be more excited already.

Thanks for reading.


Lead Programmer at Novartis Healthcare Pvt Ltd.