Nested JSON Data Structures & Row Count Impact

JSON Data Structures
Deconstruction of Nested Arrays
Connecting Subtables to Top Level Records
Impact on Total Row Count
Reducing Your Row Count

Important:
BigQuery destinations: This article is not applicable to BigQuery, as it natively supports nested structures. When Stitch encounters nested data from sources like MongoDB, that data is persisted into BigQuery with its structure intact. Because these rows are not denested, there is no additional impact on row counts when replicating these records.

Postgres ARRAY & JSON datatypes: The info in this article is NOT applicable to Postgres ARRAY and JSON data types. These data types will be stored as strings in your data warehouse, whether it’s Postgres, Panoply, or Redshift.

To understand how Stitch interprets the data it receives, you need to know a little bit about JSON.

MongoDB and many SaaS integrations use nested structures, which means each attribute (or column) in a table could have its own set of attributes. Stitch is designed to deconstruct these nested structures into separate tables to easily query the data.

In this article, we’ll cover:

JSON data structures
How nested structures are deconstructed
How to connect subtables to top level records
How nested structures impact row count

JSON Data Structures

When Stitch pulls data from an integration, it’s pulling a series of JSON records. JSON records can contain structures called objects and arrays.

Objects

An object is an unordered set of name and value pairs; each set is called a property. Objects begin with a left curly bracket ( { ) and end with a right curly bracket ( } ).

{  
   "product_id":"5008798",
   "name":"Awesome Dino Shirt",
   "price":"5.99",
   "attributes":{               // object begins
      "color":"blue",
      "size":"large",
      "type":"clothing",
      "ounces":"5"
   }                           // object ends
}

When Stitch receives an object, the properties in the object are “flattened” into the table and columns are created. Columns created from object properties follow this naming convention: [object_name]__[property_name]

If a table were created for the object in the example above, the schema would look like this:

product_id	name	price	attributes__color	attributes__size	attributes__type	attributes__ounces
5008798	Awesome Dino Shirt	5.99	blue	large	clothing	5

Arrays

An array is an ordered collection of values. Values are separated by commas and can be a string (contained in double quotes), numbers, boolean, an object, or another array. Arrays begin with a left square bracket ( [ ) and end with a right square bracket ( ] ).

Here’s an example:

{
   "order_id":"1234",
   "customer_id":"100",
   "line_items":[            // array begins
      {
         "product_id":"5008798",
         "price":"5.99"
      }
   ]                        // array ends
}

When Stitch receives a nested array - or an array that’s inside a JSON record - like the one above, it will “denest” it from the parent structure and create a subtable.

Deconstruction of Nested Arrays

To give you a better understanding of how Stitch denests arrays, we’ll walk you through an example using a Shopify order record. In this example, the order record is composed of three parts:

Core order data
Line items
Tax lines

Here’s what the JSON for the Shopify order looks like:

{
   "order_id":"1234",
   "created_at":"2015-01-01 00:00:00",
   "customer_id":"100",
   "line_items":[                         // line item record begins
      {
         "product_id":"5008798",
         "price":"5.99",
         "quantity":"1",
         "tax_lines":[                    // tax line record begins
            {
               "price":"5.99",
               "rate":"0.06",
               "title":"State Tax"
            }
         ]                               // tax line record ends
      }
   ]                                    // line item record ends
}

This record contains three levels of data due to the nested arrays. Stitch will denest the arrays from the top level record - in this case, the core order info - and create subtables. From this one order record, three tables will be created:

orders - This table contains the core order data: order ID, created timestamp, and customer ID.
orders__line_items - This table contains the line item info: product ID, price, and quantity.
orders__line_items__tax_lines - This table contains the tax line info: price, rate, and title.

Connecting Subtables to Top Level Records

When subtables are created, Stitch will append a few columns to be used as composite keys that enable you to connect subrecords back to their parent. Let’s take a look at the schemas for each of the Shopify tables to get a better idea of how this works.

Top Level: Core Order Data

This table contains the order record’s Primary Key, order_id.

order_id [pk]	created_at	customer
1234	2015-01-01 00:00:00	100

Second Level: Line Items

In addition to the attributes in the nested record - in this case, product ID, price, and quantity for line items - Stitch will add these columns to second level tables:

_sdc_source_key_[primary_key] - This contains the top level record’s Primary Key. In this example, the column would be _sdc_source_key_order_id.
_sdc_level_0_id - This forms part of a composite primary key for this row and can be used to associate further down the line nested rows to this parent. This will auto-increment for each unique record in the table, beginning with 0.

For the Shopify example, the first line item record would be 0, the second 1, the third 2, and so on.

We recommend always joining the top level table to the nested table - this will allow you to avoid queries that may have outdated data.

Here’s what the orders__line_items table would look like if another line item were added to the order record:

_sdc_source_key_order_id	_sdc_level_0_id	product_id	price	quantity
1234	0	5008798	5.99	1
1234	1	3445689	10.99	1

If you wanted to return all line items for order number 1234, you’d run the following query:

     SELECT *
     FROM orders__line_items li
     WHERE _sdc_source_key_order_id = 1234

Third Level: Tax Lines

In addition to the attributes in the nested record - in this case, price, rate, and title for tax lines - Stitch will add these columns to third level tables:

_sdc_source_key_[primary_key] - This contains the top level record’s Primary Key. In this example, the column would be _sdc_source_key_order_id.
_sdc_level_0_id - This is the foreign key for the second level (orders__line_items) table. Combined with the source key (_sdc_source_key_order_id), it can be used to find the parent.
_sdc_level_1_id - This forms part of a composite primary key for this row and can be used to associate further down the line nested rows to this parent

For the Shopify example, the first tax line record would be 0, the second 1, the third 2, and so on.

Here’s what the orders__line_items__tax_lines table would look like if we added another tax line were added to the order record:

_sdc_source_key_order_id	_sdc_level_0_id	_sdc_level_1_id	price	rate	title
1234	0	0	5.99	.06	State Tax
1234	1	0	10.99	.06	State Tax

If we wanted to return all line items and tax lines for order number 1234, we’d run the following query:

     SELECT *
     FROM orders__line_items li
     INNER JOIN orders__line_items__tax_lines tl
     ON tl.__sdc_level_0_id = li.__sdc_level_0_id
     AND tl.__sdc_source_key_order_id = li.__sdc_source_key_order_id
     WHERE _sdc_source_key_order_id = 1234

Impact on Total Row Count

Because Stitch is built to denest nested arrays into separate tables, you can expect to see more rows in Stitch and in your data warehouse than what’s in the source itself.

To sum it up: row count in original data source ≠ the row count reflected in the Stitch app or your data warehouse.

Consider the Shopify example. Order 1234 isn’t just a just a single row in the data warehouse. Because Stitch had to denest subrecords and create tables to accommodate these records, you can expect to see more than just one row for order 1234 moving through Stitch.

From the top level record, there’s the row in the orders table:

order_id [pk]	created_at	customer
1234	2015-01-01 00:00:00	100

From the second level record, we have the rows in the orders__line_items table:

_sdc_source_key_order_id	_sdc_level_0_id	product_id	price	quantity
1234	0	5008798	5.99	1
1234	1	3445689	10.99	1

From the third level record, we have the rows in the orders__line_items__tax_lines table:

_sdc_source_key_order_id	_sdc_level_0_id	_sdc_level_1_id	price	rate	title
1234	0	0	5.99	.06	State Tax
1234	1	0	10.99	.06	State Tax

In total, Stitch will count each of these rows (a total of 5) towards your row count. So while there is only one record in the Shopify data source, the Stitch app will show 5 replicated rows and there’ll be 5 rows in your data warehouse.

Reducing Your Row Count

Understanding how Stitch handles nested data structures will in turn give you a deeper understanding of how your data is structured once it gets to your data warehouse. While this knowledge will help comprehending the data’s structure, what about applying it to how many rows you’re using? How can you plan ahead?

For starters, check out the Reducing Your Row Count guide for detailed tips and gotchas on how to keep your row count in line.
Take some time to learn about how the data for your integrations is structured and how it’s replicated.

Check out the SaaS Schemas section for detailed info on Replication Methods and the tables you can expect to see for each integration. Every article links to that service’s API documentation, which is the best way to learn about the data for that integration is structured.
Set your integrations to replicate less frequently, if feasible.

Questions? Feedback?

Did this article help? If you have questions or feedback, please reach out to us.

Tags:

Related	Troubleshooting
Loading Data into Your Destination Table Schema Changes Nested Structures & Row Count Impact Querying Append-Only Tables	Data Discrepancy Guide Identifying Rejected Records