Data Ingestion and Bigquery — How to load data from external sources on to BigQuery?

Google BigQuery

Speed and Scalability are two biggest advantages of Google BigQuery, complimenting its serverless architecture. BigQuery is comparatively a cost effective data warehouse that allows querying data that can scale up to petabytes. Being serverless allows customers to concentrate on insights than to manage infrastructure. BigQuery does not just fit to one paradigm of data lifecycle. You can use BigQuery to ingest, process, store and even perform core analytics on data.

If you have an external data source, you can either load data into BigQuery or you can query the external source without loading it at all (Federated Query). The key difference is performance. Cloud native tables has high performance and low latency than the external tables.

There are several ways of loading data into BigQuery depending on various factors. You can choose to load data in batches or stream. Or you may even want to check the option to integrate your existing partners of google (One example is Informaticas intelligent cloud service offering).

Batch Load or Batch Ingestion involves loading large datasets at regular frequencies according to criteria set by developers. BigQuery can ingest both compressed and uncompressed files from cloud storage which is often deemed as a preference to land your external data before loading it to BigQuery. Loading data into BigQuery does not incur any charges, while storage will. Unless your business needs near real time data, you can use loading rather than streaming.

Although BigQuery supports loading data from multiple file formats, speed varies depending on the file type. Choosing the file format is important, as it affects not just the load speed but cost of storage too. AVRO(Compresses/Uncompressed) followed by Parquet/ORC are cheaper when compared to CSV & JSON. If you have a choice, you should choose to export your data in a format that is faster, efficient and cheap.

Loading data into BigQuery

You can either load data from your computer or cloud storage (GCS) into BigQuery table. Imagine you already created a dataset and an empty table within BigQuery. To load external data you should navigate to BigQuery from Navigation menu.

Click “Create Table” and select the source of data. If you choose “Cloud Storage”, you should browse to locate the storage bucket where the load file is saved. And if you choose “Upload” you should browse and select the file or enter the path where it is saved.

Loading file from GCS Bucket

You can either enter schema manually, or detect schema automatically. Alternatively you can check the radio button to auto detect schema.

Under the advanced options, you can choose either to overwrite the existing table, append it or to write if the existing table is empty. Once you click create table, the data loads from the file reference you use and creates a table within BigQuery.

Demonstration of BQ Console while loading data.

Googles bq command line interface (CLI) is a python based tool. To use CLI within you GCP console, you must install and initialize the Cloud SDK. Once you are through this step, you can activate cloud shell from the home screen, which opens a cloud shell terminal. You can load, update, query, list and do many more operations on data from terminal using “BQ” command.

For example, the script below loads a JSON file named “Netflix_Titles” saved under cloud storage to a table “Netflix_Titles” under sample dataset within BigQuery.

bq load \ 
— autodetect \
— source_format=NEWLINE_DELIMITED_JSON \
sample.Netflix_Titles \
gs://mybucket/ Netflix_Titles.json

Auto detect detects the schema automatically when reading the JSON file from cloud storage bucket. If you want to specify a schema, you can use a schema file or type schema inline. You can use options such as replace and noreplace to append, overwrite or write data to an empty table. (Look for BQ documentation).

You can load data into BigQuery from runtimes such as Java and Python. To use the BigQuery API, you must authenticate your identity using service account. You can either choose to authenticate with application default credentials or with a service account key file.

In the example below, a local CSV file is being read to pandas dataframe and then loads same data to a bigquery table using load_table_from_dataframe() function. As a prerequisite, you should install pandas, pyarrow along with BigQuery python client library.

pip install google-cloud-bigquery[pandas,pyarrow]from google.cloud import bigquery
import pandas
client = bigquery.Client()
df = pd.read_csv(‘<Path of your CSV>’) #Reads the local CSV File
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField(<SPECIFY SCHEMA>),
bigquery.SchemaField(<SPECIFY SCHEMA>),
],
write_disposition=”WRITE_TRUNCATE”, #replaces the existing table with the loaded data
)
table_id = “Project.Dataset.Table”job = client.load_table_from_dataframe(
netflix_df , table_id=table_id, job_config=job_config
) # Make an API request.
job.result()

Apart from the above, you can also load data from pipelines created using Dataflow or Dataproc into BigQuery.

Lead Programmer at Novartis Healthcare Pvt Ltd.