Removing Database Integration Columns from Redshift

Important: Columns can only be removed from tables created by database integrations. Columns created by SaaS integrations cannot be removed from a data warehouse at this time.

When you rename or no longer want to sync a column, what happens? For both Full Table and Incremental replication, the old column and all historical data will remain in the table even if there aren’t any new values being replicated.

For some Stitch users, retaining these columns is perfectly fine. If you like to keep things tidy, however, you can easily remove the unwanted columns by recreating your realized tables without those columns.

We recommend using pg_dump for this process, which is similar to altering the SORT and DIST keys on your tables.

Retrieving the Table Definition

In this example, we’ll show you how to remove the unwanted columns using pg_dump from the command line. We marked everything you’ll need to define yourself in square brackets [like this].

Note: If any new data is detected for the deleted column, Stitch will recreate the column in your data warehouse.

First, you’ll grab a full definition of your target table and then create the new table structure, removing the unwanted column(s):

pg_dump --host [yourawshost.redshift.amazonaws.com] --port [your_port] --username [admin_username][database_name] -t '[schema_name.original_target_table_name]'

The above command will return a response similar to the following:

SET statement_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = off;
SET check_function_bodies = false;
SET client_min_messages = warning;
SET escape_string_warning = off;

SET search_path = schema_name, pg_catalog;

SET default_tablespace = '';

SET default_with_oids = true;

--
-- Name: original_target_table_name; Type: TABLE; Schema: schema_name; 
  Owner: admin_username; Tablespace: 
--

CREATE TABLE original_target_table_name (
     id bigint,
     value character varying(128),
     the_column_being_removed numeric(18,0), 
      _sdc_sequence numeric(18,0),
      _sdc_received_at timestamp without time zone,
      _sdc_batched_at timestamp without time zone,
      _sdc_table_version bigint,
      _sdc_replication_id character varying(128)
);

ALTER TABLE schema_name.original_target_table_name OWNER TO admin;

--
-- Data for Name: original_target_table_name; Type: TABLE DATA; Schema: 
  schema_name; Owner: admin
--
...

Retrieving the Table’s Primary Key

Next, you need to retrieve the table’s Primary Key comment. This will be used in the next step to indicate which column(s) are the table’s Primary Keys.

Run the following query:

SELECT description FROM pg_description WHERE objoid = '[original_target_table_name]'::regclass;             
   // This will retrieve the Primary Key comment: {"primary_keys":["XXXXX"]}

This will be used in the next step to indicate which column(s) are the table’s Primary Keys.

Important: Redshift doesn’t enforce the use of Primary Keys, but Stitch requires them to replicate data. In the following example, you’ll see COMMENT being used to note the table’s Primary Key. Make sure you include the Primary Key comment in the next step, as missing Primary Keys will cause issues with data replication.

Copying Data into the New Table

Next, you’ll SELECT all the historical data from the unwanted column into the new table. When you run this transaction yourself, you’ll need to change everything inside [the square brackets] as well as the following:

The column names in the table. Be sure to add _rjm or _sdc columns into the new table schema.
In the ALTER TABLE OWNER line, you’ll see [stitch_redshift_user]. This is the username of the Redshift user that Stitch uses to connect to your data warehouse. Failing to enter the Stitch username here will prevent Stitch from replicating data for this table.

Here’s the transaction we ran for our example table. Note where we’ve marked the column we want to remove - remove this when running the transaction yourself:

SET search_path to [schema_name];
BEGIN;
 ALTER TABLE [table_name] RENAME TO [old_table_name];
 CREATE TABLE [new_table_name] (
    id bigint,
    value character varying(128),
    column_being_removed (18,0),    // This the undesired column - take it out
    _sdc_sequence numeric(18,0),
    _sdc_received_at timestamp without time zone,
    _sdc_batched_at timestamp without time zone,
    _sdc_table_version bigint,
    _sdc_replication_id character varying(128)
 );
 INSERT INTO [new_table_name]
    (SELECT id, value, _sdc_sequence,     // The schema you want goes here, too
    _sdc_received_at, _sdc_batched_at, _sdc_table_version, 
    _sdc_replication_id
    FROM [old_table_name]);

COMMENT ON table [new_table_name] IS '{"primary_keys":["XXXXX"]}';     // Sets Primary Key comment

ALTER TABLE [new_table_name] RENAME TO [table_name];
ALTER TABLE [table_name] OWNER TO [stitch_redshift_user];     // Grants table ownership to Stitch
DROP TABLE [old_table_name];              // Drops the "old" table with the undesired column
END;

Verifying the Table Owner

When you perform this process yourself, make sure that the Stitch Redshift user retains ownership of the table.

If Stitch isn’t the owner of the table, issues with data replication will arise.

Verifying the New Schema

Verify your changes by using this command to retrieve the table’s schema:

\d+ [schema_name].[table_name]

In our case, the response would look something like this:

Column              |  Data Type
--------------------+-----------------------------+
id                  |  BIGINT
value               |  VARCHAR(128)
_sdc_sequence       |  NUMERIC
_sdc_received_at    |  TIMESTAMP WITHOUT TIMEZONE
_sdc_batched_at     |  TIMESTAMP WITHOUT TIMEZONE
_sdc_table_version  |  BIGINT
_sdc_replication_id |  VARCHAR(128)

Questions? Feedback?

Did this article help? If you have questions or feedback, please reach out to us.

Tags:

Related	Troubleshooting
Choosing a Destination Destination & Integration Compatibility Loading Data into Your Destination Nested Data Structures & Row Count Impact	Redshift Data Loading Errors Redshift View Dependency Errors Redshift VACUUM Errors Data Warehouse Connection Errors