Federated and External Queries — The what and how?

BigQuery has the capability that allow users to query data from external data sources without the need of loading data at all. Currently BQ supports GC storage systems such as Cloud Storage, Bigtable, Google Drive and Cloud SQL. To use the federated queries, you should colocate your BQ dataset and the external data source. If you are using Cloud Bigtable, you might want to check the supported locations before making a decision.

If you are running a federated query on Cloud Storage or Google Drive or Cloud Bigtable, the process is pretty much the same with minor to no changes.

Using Cloud Console

Navigate to BigQuery console, and click create table under the desired project :: dataset to run the federated query like you do with a BQ table. Choose the desired source of your choice. Please note, you cannot create an external table for Cloud Bigtable from cloud console due to current limitations.

You can select the external data source as highlighted and under table type select “External Table”.

Using CLI

You can either create a permanent table or a temporary one in order to run a query.

##Cloud Storage - Run Query on permanent table.#Make table definition file
bq mkdef \
--autodetect \
--source_format=CSV "gs://bucket-fedqueries/netflix_titles.csv" > netflix.csv
#Create External Table
bq mk --external_table_definition= netflix.csv Sample.netflix01
#Run Federated Query as if it were a native BigQuery table.##Cloud Storage - Run Query on a temporary table.bq query \
--external_table_definition = netflix::show_id:STRING,release_year:INTEGER,Country:STRING, Genre:STRING@CSV=gs://bucket-fedqueries/netflix_titles.csv \
'SELECT
Genre,
Count(show_id) as Total_Shows
FROM
netflix
WHERE
release_year="2020"
GROUP BY
Type
ORDER BY
Total_Shows'

If you are querying data in Google drive, you need to authenticate drive before creating table definition files.

##Google Drive - Run Query on temporary table.#Authenticate Google Drivegcloud auth login --enable-gdrive-accessbq query \
--external_table_definition = /tmp/netflix::show_id:STRING,release_year:INTEGER,Country:STRING, Genre:STRING@CSV=https://drive.google.com/file/d/Netflix.csv\
'SELECT
Genre,
Count(show_id) as Total_Shows
FROM
netflix
WHERE
release_year="2020"
GROUP BY
Type
ORDER BY
Total_Shows'

Running a federated query on Cloud Bigtable is almost same as G Drive or GCS, except you need to create a table definition file manually as mkdef does not support Bigtable currently.

To run an external query on Cloud SQL, we need to create a Cloud SQL instance and then add the data using BigQuery Console.

Using Cloud Console

You can create Cloud Instance from Navigation Menu -> SQL -> Select the desired DB (MySQL/PostgreSQL) -> Enter the details as required. Under Create instance page, you can customize machine type, storage type etc.

Create an SQL Instance using the preferred DB.

Once the instance is created, navigate to BigQuery and click Add data. Fill in the details of cloud instance you have created in the above step:

“Add Data” under BigQuery console.

You can copy the cloud instance id from cloud instance page:

Connection name = Cloud instance id

As soon as you click create, you would see the external connection under your project.

Cloud SQL connection for BigQuery

You are all set to query the MySQL database available under Cloud SQL from BigQuery. A query run from console looks like this:

SELECT * FROM EXTERNAL_QUERY
(“project.us.Connect_to_MySQL”, “SELECT * FROM INFORMATION_SCHEMA.TABLES;”);

You can also create the Cloud SQL instance and add data into BQ using CLI commands gcloud and bq mk.

Before running a federated query on any external data source, one should have required IAM permissions, enable necessary APIs, read about limitations, location considerations and pricing.

References:

Lead Programmer at Novartis Healthcare Pvt Ltd.