Google BigQuery Destination | Stitch Documentation

Heads Up: BigQuery is in Beta
Our BigQuery destination is currently in Open Beta. As such, the info in this article may change.

Google BigQuery is a fully managed, cloud-based big data analytics web service for processing very large read-only data sets. BigQuery was designed for analyzing data on the order of billions of rows, using a SQL-like syntax.

For more information, check out Google’s BigQuery overview.

In this guide, we'll walk you through:

BigQuery's pricing model,
Some high-level limitations (including any incompatible data sources),
How to spin up a BigQuery data warehouse of your own, and
How Stitch loads and organizes data in BigQuery.

Pricing

Unlike many other cloud-based data warehouse solutions, BigQuery’s pricing model is based on usage and not a fixed-rate. This means that your bill can vary over time.

Before fully committing yourself to using BigQuery as your data warehouse, we recommend familiarizing yourself with the BigQuery pricing model and how using Stitch may impact your costs.

Learn more about Stitch & BigQuery pricing

Limitations

Every database has its own supported limits and way of handling data, and BigQuery is no different. The table below provides a very high-level look at what BigQuery supports, including any possible incompatibilities with Stitch’s integration offerings.

Incompatibile Sources	Possible incompatibilities. Learn more.
Reserved Words	BigQuery’s full list; Stitch reserves `_rjm`, `_sdc`, and data type suffixes (`__bigint`)
Table Name Length	1,024 characters	Column Name Length	128 characters
Max # of Columns	10,000	VARCHAR Max Width	None
Nested Structure Support	Native Support	Case	Case Insensitive

Comparing Destinations

Not sure if BigQuery is the data warehouse for you? Check out the Choosing a Stitch Destination guide to compare each of Stitch’s destination offerings.

Setup

While the setup process for BigQuery is pretty straightforward, it isn’t as simple as just entering credentials into Stitch. In addition to completing the authorization process inside Stitch, we also require a user that:

Has access to a Google Cloud Platform project within BigQuery
Has access to a project where billing is enabled
Has Google Cloud Storage privileges

Spin up a BigQuery data warehouse

Schema

After you’ve successfully connected your BigQuery data warehouse to Stitch, you can start adding integrations and replicating data.

For each integration that you add to Stitch, a schema specific to that integration will be created in your data warehouse. This is where all the tables for that inegration will be stored.

Stitch will encounter dozens of scenarios when replicating and loading your data. To learn more about how BigQuery handles these scenarios, check out the Data Loading Guide for BigQuery.

Rejected Records Log

Occasionally, Stitch will encounter data that it can’t load into the data warehouse. For example: a table contains more columns than BigQuery’s allowed limit of 10,000 columns per table.

When this happens, the data will be “rejected” and logged in a table called _sdc_rejected. Every integration schema created by Stitch will include this table as well as the other tables in the integration.

Learn more about the rejected records log

Replication

Stitch replicates data from your sources based on the integration’s Replication Frequency and the Replication Method used by the tables in the integration. In Stitch, you have the ability to control what and how often data is replicated for the majority of integrations.

The current release of Stitch’s BigQuery destination uses Append-Only Incremental Replication.

For SaaS and database tables that use Incremental Replication, Stitch will replicate data into BigQuery in an append-only fashion. This means that data that updates existing an existing row will NOT overwrite the row. Instead, a new row with the new data will be appended to the end of the table.

This means that there can be many different rows in a BigQuery table with the same Primary Key, each representing what the data was at that moment in time.

Click here for an in-depth example

Our example table contains the following rows to start with:

| Name | Age | Type  | Magic?  | _sdc_sequence |
|------+-----+-------+---------+---------------|
| Finn | 14  | Human | False   | 1473773877524 |
| Jake | 6   | Dog   | True    | 1473773877524 |

These rows are replicated during the initial sync of the table. Some time later, new rows are added to the table:

| Name      | Age | Type     | Magic? | _sdc_sequence |
|-----------+-----+----------+--------+---------------|
| Bubblegum | 16  | Princess | True   | 1473776585195 |
| Beamo     | 1   | Game Bot | False  | 1473776585195 |

During the next replication job, the new rows are appended to the table, bringing the total row count to 4:

| Name      | Age | Type     | Magic? | _sdc_sequence |
|-----------+-----+----------+--------+---------------|
| Finn      | 14  | Human    | False  | 1473773877524 |
| Jake      | 6   | Dog      | True   | 1473773877524 |
| Bubblegum | 16  | Princess | True   | 1473776585195 |  // new rows are appended
| Beamo     | 1   | Game Bot | False  | 1473776585195 |

But what happens when an existing row is updated? If Finn's age were updated to 15, the existing row wouldn't be updated - instead, an entirely new row would be appended to the end of the table:

| Name      | Age | Type     | Magic? | _sdc_sequence |
|-----------+-----+----------+--------+---------------|
| Finn      | 14  | Human    | False  | 1473773877524 |
| Jake      | 6   | Dog      | True   | 1473773877524 |
| Bubblegum | 16  | Princess | True   | 1473776585195 |
| Beamo     | 1   | Game Bot | False  | 1473776585195 |
| Finn      | 15  | Human    | False  | 1473827984228 |  // updated data as new row

Querying an append-only table requires a different strategy than you would normally lose. For more info, check out the Querying Append-Only Tables guide.

The time from the sync start to data being loaded into your data warehouse can vary depending on a number of factors, especially for initial historical loads.

To learn more about Stitch’s replication process and how loading time can be affected, check out the Stitch Replication Process guide.

Questions? Feedback?

Did this article help? If you have questions or feedback, please reach out to us.

Tags:

Related	Troubleshooting
Choosing a Destination Destination & Integration Compatibility Loading Data into Your Destination BigQuery & Append-Only Replication Stitch's Impact on BigQuery Costs	Data Warehouse Connection Errors