BigQuery destinations: This article is not applicable to BigQuery, as it natively supports nested structures. When Stitch encounters nested data from sources like MongoDB, that data is persisted into BigQuery with its structure intact. Because these rows are not denested, there is no additional impact on row counts when replicating these records.
Postgres
ARRAY
& JSON
datatypes: The info in this article is NOT applicable to Postgres ARRAY
and JSON
data types. These data types will be stored as strings
in your data warehouse, whether it’s Postgres, Panoply, or Redshift.To understand how Stitch interprets the data it receives, you need to know a little bit about JSON.
MongoDB and many SaaS integrations use nested structures, which means each attribute (or column) in a table could have its own set of attributes. Stitch is designed to deconstruct these nested structures into separate tables to easily query the data.
In this article, we’ll cover:
- JSON data structures
- How nested structures are deconstructed
- How to connect subtables to top level records
- How nested structures impact row count
JSON Data Structures
When Stitch pulls data from an integration, it’s pulling a series of JSON records. JSON records can contain structures called objects and arrays.
Objects
An object is an unordered set of name and value pairs; each set is called a property. Objects begin with a left curly bracket ( {
) and end with a right curly bracket ( }
).
When Stitch receives an object, the properties in the object are “flattened” into the table and columns are created. Columns created from object properties follow this naming convention: [object_name]__[property_name]
If a table were created for the object in the example above, the schema would look like this:
product_id | name | price | attributes__color | attributes__size | attributes__type | attributes__ounces |
---|---|---|---|---|---|---|
5008798 | Awesome Dino Shirt | 5.99 | blue | large | clothing | 5 |
Arrays
An array is an ordered collection of values. Values are separated by commas and can be a string (contained in double quotes), numbers, boolean, an object, or another array. Arrays begin with a left square bracket ( [
) and end with a right square bracket ( ]
).
Here’s an example:
When Stitch receives a nested array - or an array that’s inside a JSON record - like the one above, it will “denest” it from the parent structure and create a subtable.
Deconstruction of Nested Arrays
To give you a better understanding of how Stitch denests arrays, we’ll walk you through an example using a Shopify order record. In this example, the order record is composed of three parts:
- Core order data
- Line items
- Tax lines
Here’s what the JSON for the Shopify order looks like:
This record contains three levels of data due to the nested arrays. Stitch will denest the arrays from the top level record - in this case, the core order info - and create subtables. From this one order record, three tables will be created:
orders
- This table contains the core order data: order ID, created timestamp, and customer ID.orders__line_items
- This table contains the line item info: product ID, price, and quantity.orders__line_items__tax_lines
- This table contains the tax line info: price, rate, and title.
Connecting Subtables to Top Level Records
When subtables are created, Stitch will append a few columns to be used as composite keys that enable you to connect subrecords back to their parent. Let’s take a look at the schemas for each of the Shopify tables to get a better idea of how this works.
Top Level: Core Order Data
This table contains the order record’s Primary Key, order_id
.
order_id [pk] | created_at | customer |
---|---|---|
1234 | 2015-01-01 00:00:00 | 100 |
Second Level: Line Items
In addition to the attributes in the nested record - in this case, product ID, price, and quantity for line items - Stitch will add these columns to second level tables:
_sdc_source_key_[primary_key]
- This contains the top level record’s Primary Key. In this example, the column would be_sdc_source_key_order_id
.-
_sdc_level_0_id
- This forms part of a composite primary key for this row and can be used to associate further down the line nested rows to this parent. This will auto-increment for each unique record in the table, beginning with 0.For the Shopify example, the first line item record would be 0, the second 1, the third 2, and so on.
We recommend always joining the top level table to the nested table - this will allow you to avoid queries that may have outdated data.
Here’s what the orders__line_items
table would look like if another line item were added to the order record:
_sdc_source_key_order_id | _sdc_level_0_id | product_id | price | quantity |
---|---|---|---|---|
1234 | 0 | 5008798 | 5.99 | 1 |
1234 | 1 | 3445689 | 10.99 | 1 |
If you wanted to return all line items for order number 1234, you’d run the following query:
Third Level: Tax Lines
In addition to the attributes in the nested record - in this case, price, rate, and title for tax lines - Stitch will add these columns to third level tables:
_sdc_source_key_[primary_key]
- This contains the top level record’s Primary Key. In this example, the column would be_sdc_source_key_order_id
._sdc_level_0_id
- This is the foreign key for the second level (orders__line_items
) table. Combined with the source key (_sdc_source_key_order_id), it can be used to find the parent.-
_sdc_level_1_id
- This forms part of a composite primary key for this row and can be used to associate further down the line nested rows to this parentFor the Shopify example, the first tax line record would be 0, the second 1, the third 2, and so on.
Here’s what the orders__line_items__tax_lines
table would look like if we added another tax line were added to the order record:
_sdc_source_key_order_id | _sdc_level_0_id | _sdc_level_1_id | price | rate | title |
---|---|---|---|---|---|
1234 | 0 | 0 | 5.99 | .06 | State Tax |
1234 | 1 | 0 | 10.99 | .06 | State Tax |
If we wanted to return all line items and tax lines for order number 1234, we’d run the following query:
Impact on Total Row Count
Because Stitch is built to denest nested arrays into separate tables, you can expect to see more rows in Stitch and in your data warehouse than what’s in the source itself.
To sum it up: row count in original data source ≠ the row count reflected in the Stitch app or your data warehouse.
Consider the Shopify example. Order 1234 isn’t just a just a single row in the data warehouse. Because Stitch had to denest subrecords and create tables to accommodate these records, you can expect to see more than just one row for order 1234 moving through Stitch.
From the top level record, there’s the row in the orders
table:
order_id [pk] | created_at | customer |
---|---|---|
1234 | 2015-01-01 00:00:00 | 100 |
From the second level record, we have the rows in the orders__line_items
table:
_sdc_source_key_order_id | _sdc_level_0_id | product_id | price | quantity |
---|---|---|---|---|
1234 | 0 | 5008798 | 5.99 | 1 |
1234 | 1 | 3445689 | 10.99 | 1 |
From the third level record, we have the rows in the orders__line_items__tax_lines
table:
_sdc_source_key_order_id | _sdc_level_0_id | _sdc_level_1_id | price | rate | title |
---|---|---|---|---|---|
1234 | 0 | 0 | 5.99 | .06 | State Tax |
1234 | 1 | 0 | 10.99 | .06 | State Tax |
In total, Stitch will count each of these rows (a total of 5) towards your row count. So while there is only one record in the Shopify data source, the Stitch app will show 5 replicated rows and there’ll be 5 rows in your data warehouse.
Reducing Your Row Count
Understanding how Stitch handles nested data structures will in turn give you a deeper understanding of how your data is structured once it gets to your data warehouse. While this knowledge will help comprehending the data’s structure, what about applying it to how many rows you’re using? How can you plan ahead?
- For starters, check out the Reducing Your Row Count guide for detailed tips and gotchas on how to keep your row count in line.
-
Take some time to learn about how the data for your integrations is structured and how it’s replicated.
Check out the SaaS Schemas section for detailed info on Replication Methods and the tables you can expect to see for each integration. Every article links to that service’s API documentation, which is the best way to learn about the data for that integration is structured.
- Set your integrations to replicate less frequently, if feasible.
Related | Troubleshooting |
Questions? Feedback?
Did this article help? If you have questions or feedback, please reach out to us.