Cloud BigQuery; An Introduction

Google's BigQuery is an enterprise-grade cloud-native data warehouse. BigQuery was first launched as a service in 2010 with general availability in November 2011. Since inception, BigQuery has evolved into a more economical and fully-managed data warehouse which can run blazing fast interactive and ad-hoc queries on datasets of petabyte scale. In addition, BigQuery now integrates with a variety of Google Cloud Platform (GCP) services and third-party tools which makes it more useful.

BigQuery is serverless, or more precisely data warehouse as a service. There are no servers to manage or database software to install. BigQuery service manages underlying software as well as infrastructure including scalability and high-availability. The pricing model is quite simple - for 1 TB of data processed you pay $5. BigQuery exposes simple client interface which enables users to run interactive queries.

Overall, you don't need to know much about underlying BigQuery architecture or how this service operates under the hood. That's the whole idea of BigQuery - you don't need to worry about architecture and operation. To get started with BigQuery, your must be able to import your data into BigQuery, then be able to write your queries using SQL dialects offered by BigQuery.

Having said that, a good understanding of BigQuery architecture is useful when implementing various BigQuery best-practices including controlling costs, optimizing query performance, and optimizing storage. For instance, for best query performance, it is highly beneficial to understand how BigQuery allocates resources and relationship between the number of slots and query performance.

High-level Architecture

High-level architecture

BigQuery is built on top of Dremel technology which has been in production internally in Google since 2006. Dremel is Google's interactive ad-hoc query system for analysis of read-only nested data. Original Dremel papers were published in 2010 and at the time of publication Google was running multiple instances of Dremel ranging front tens to thousands of nodes.

10,000 foot view

BigQuery and Dremel share the same underlying architecture. By incorporating columnar storage and tree architecture of Dremel, BigQuery offers unprecedented performance. But, BigQuery is much more than Dremel. Dremel is just an execution engine for the BigQuery. In fact, BigQuery service leverages Google's innovative technologies like Borg, Colossus, Capacitor, and Jupiter. A BigQuery client (typically BigQuery Web UI or bg command-line tool or REST APIs) interact with Dremel engine via a client interface. Borg - Google's large-scale cluster management system - allocates the compute capacity for the Dremel jobs. Dremel jobs read data from Google's Colossus file systems using Jupiter network, perform various SQL operations and return results to the client. Dremel implements a multi-level serving tree to execute queries which are covered.

It is important to note, BigQuery architecture separates the concepts of storage (Colossus) and compute (Borg) and allows them to scale independently - a key requirement for an elastic data warehouse. This makes BigQuery more economical and scalable compared to its counterparts.

BigQuery Storage

The most expensive part of any Big Data analytics platform is almost always disk I/O. BigQuery stores data in a columnar format known as Capacitor. As you may expect, each field of BigQuery table i.e., column is stored in a separate Capacitor file which enables BigQuery to achieve very high compression ratio and scan throughput. In 2016, Capacitor replaced ColumnIO - the previous generation optimized columnar storage format. Unlike ColumnIO, Capacitor enabled BigQuery to directly operate on compressed data, without decompressing the data on the fly.

You can import your data into BigQuery storage via Batch loads or Streaming. During the import process, BigQuery encodes every column separately into Capacitor format. Once all column data is encoded, it's written back to Colossus. During encoding various statistics about the data is collected which is later used for query planning.

BigQuery leverages Capacitor to store data in Colossus. Colossus is Google's latest generation distributed file system and successor to GFS (Google File Systems). Colossus handles cluster-wide replication, recovery and distributed management. It provides client-driven replication and encoding. When writing data to Colossus, BigQuery makes some decision about initial sharding strategy which evolves based on the query and access patterns. Once data is written, to enable the highest availability BigQuery initiates geo-replication of data across different data centers.

In a nutshell, Capacitor and Colossus are key ingredients of industry-leading performance characteristics offered by BigQuery. Colossus allows splitting of the data into multiple partitions to enable blazing fast parallel read wheras Capacitor reduces requires scan throughput. Together they make possible to process a terabyte data per second.

Native vs. External

So far we have discussed the storage for the native BigQuery table. BigQuery can also perform queries against external data sources without the need to import data into the native BigQuery tables. Currently, BigQuery can perform direct queries against Google Cloud Bigtable, Google Cloud Storage, and Google Drive.

When using an external data source (aka federated source), BigQuery performs on-the-fly loading of data into Dremel engine. Generally speaking, queries running against external data sources type. For instance, queries against Google Cloud Storage will perform better than Google Drive If performance is a concern then you should always import data into BigQuery table before running the queries.

Compute

BigQuery takes advantage of Borg for data processing. Borg simultaneously runs thousands of Dremel jobs across one or more clusters made up of tens of thousands of machines. In addition to assigning compute capacity for Dremel jobs, Borg handles fault-tolerance.

To optimize the performance, consider the following best practices for Google Compute Engine:

1. Ensure that the Secure Agent hosted on the Google Compute Engine virtual machine is in the same region as the Google BigQuery dataset and the Google Cloud Storage bucket.

2. Choose the Google BigQuery dataset in the region where the Google Compute Engine virtual machine is located.

3. Choose the correct Google Compute Engine virtual machine instance type based on your requirements.

BigQuery Network

Apart from disk I/O, big data workloads are often rate-limited by network throughput. Due to the separating between compute and storage layers, BigQuery requires an ultra-fast network which can deliver terabytes of data in seconds directly from storage into compute for running Dremel jobs. Google's Jupiter network enables BigQuery service to utilize 1 Petabit/sec of total bisection bandwidth.

Search This Blog

Vodafone UK Company

Cloud BigQuery; An Introduction

Comments

Post a Comment

Popular posts from this blog

The Morph Concept in 2025: From Vision to Emerging Reality

Mortgage Train 2025

Web Train 2025: Locomotives