How Two Necks in Redshift can help you from Losing your Head in Tableau

by Colin Carson, Sr. Business Intelligence Consultant

Data Warehouse Diagram

Abstract

When a curated data set is needed for reporting, a common design pattern is to create a view in the same database as the warehouse. This view serves as the endpoint or “head” for BI tools and end-users. This “one view in the warehouse db” pattern (a view in the warehouse’s database) provides a layer of abstraction from underlying warehouse structures–the facts and dimensions in a star schema–but is not always the best approach due to several weaknesses (detailed below). Here we propose a high-level “two necks” design pattern with six parts: a dedicated reporting data mart database, a view in this reporting database, two staging tables, a view on top that serves as the single head or endpoint for users, and a processing approach that swaps between one of the two neck tables.

Assumptions

  • We assume some or all of the negatives in the Antipattern section are things your business wants to avoid.
  • We assume you desire a live connection to the warehouse rather than Tableau data extracts. In our solution, we found extracts performed slightly faster but the data could not be refreshed as frequently as our etl that created our user-facing endpoint view. Our ETL process ran microbatches every 30 seconds using a shell script that used psql to call sql commands.
  • We assume the cost of adding additional steps to your etl logic to swap tables is worth the benefits.
  • Our implementation of this pattern used a Redshift warehouse because the transactional source data and etl was already in AWS, and our dashboard used aggregations over potentially many rows (less than a million to start with, but potentially hundreds of millions). There are other databases that could be used for warehousing, and other databases might be more appropriate if high concurrency is required: Redshift supports 500 concurrent user connects, Aurora up to 16000, and when Tableau server is used with Redshift the default parallel query limit is 8.
  • We used Tableau for the dashboard, but this pattern can be applied to other frontend presentation tools too.

Antipattern

A view in the warehouse’s database is simple and can be implemented quickly, but it has weaknesses:

  • Tightly coupled: changes for end-users need to happen to the view in the warehouse’s database. Because downstream BI tools or single-tenant data marts are pointed to a view in the warehouse’s database, it means the end-users and warehouse are tightly coupled. When significant warehouse design changes need to occur, there is downtime and users are impacted. Conversely, when design changes are needed on the reporting side, the view in the warehouse’s database would need to be changed (because the reporting structures live in the warehouse, not in their own separate reporting database). Reporting changes cannot be made independently of warehouse changes. For more information about a reporting database, one resource is Martin Fowler’s writeup here.
  • Fully populating large tables takes time, and users are impacted if they have to wait for the load to finish. Users will see zero rows between the time the table is truncated and fully populated, which can happen frequently in a near real time pipeline (eg every 30 seconds). The duration of the refresh can be lengthy, because the table needs to be fully populated during this step (the database might have to do a lot of work). This processing delay for warehouse base tables can be lessened if processing involves a merge/upsert approach as documented here rather than a full load/destructive refresh; nevertheless, users do not necessarily not have to wait if processing is happening.
  • No backup. There is only a single copy of the data that users access at any time. When the warehouse is processed, no snapshot is saved of a previous state, which can be potentially restored if there is a problem. If there is a problem with the etl job processing the warehouse structures (such between the time a base table is truncated and fully populated, or if there is a bug with merge/upsert logic), there is no copy of data in the system and end-users are impacted (reporting is inaccessible).
  • Avoid a performance hit to the warehouse and speed up user queries. Users can negatively impact warehouse performance, and conversely warehouse processing can affect users. This is related to #1 but slightly different: where #1 is more about intermittent design changes, this item (#4) is about the performance experience by end-users. Downstream dependencies such as data marts, BI tools, and end-users are querying the warehouse indirectly through the view. There is no middle step between users and the warehouse, and no physical instantiation of the data assembled for reporting needs.

Recommended Pattern and how it can be Implemented

The weaknesses above can be mitigated with the two neck design shown in the diagram at the beginning of this article. Here’s how you can implement this solution architecture in Redshift:

 

1. Create a reporting database. Example:

CREATE SCHEMA report_schema;

2. Create a reporting view to serve as the integration component between the reporting data mart and the warehouse:

CREATE VIEW report_schema.v_example_dashboard_loader AS /*enter your logic here*/

Our view used 4 tables (1 fact and 3 dimensions), with 2 dimension tables with outer joins. We had 6 calculated fields with aggregations (3 DENSE_RANK functions, 2 SUM, and 1 MAX). and 24 columns. The view returned 539,000 rows in 2.376 seconds using a dc1.large single cluster.

3. Create the loader and backup tables. These two tables are the two necks.

CREATE TABLE report_schema.t_example_dashboard_loader

AS

SELECT *

FROM report_schema.v_example_dashboard_loader;

4. CREATE TABLE report_schema.t_example_dashboard_loader_backup

AS

SELECT *

FROM report_schema.t_example_dashboard_loader;–notice backup looks at the local primary table, not the view that hits the warehouse with a query that can have expensive logic

5. Create the “head” view that services as the endpoint for Tableau:

CREATE OR REPLACE VIEW report_schema.v_example_dashboard_userfacing

AS

SELECT *

FROM report_schema.t_example_dashboard_loader;

6. In the etl sql script, use logic that takes a backup, then swaps the user facing view to the backup, then loads the primary, then swaps to the primary. (Note: merge/upsert approaches can potentially be used here as well.)

TRUNCATE TABLE report_schema.t_example_dashboard_loader_backup;

INSERT INTO report_schema.t_example_dashboard_loader_backup

SELECT * FROM report_schema.t_example_dashboard_loader;

VACUUM report_schema.t_example_dashboard_loader_backup;

CREATE OR REPLACE VIEW report_schema.v_example_dashboard_userfacing

AS

SELECT *

FROM report_schema.t_example_dashboard_loader_backup;

TRUNCATE TABLE report_schema.t_example_dashboard_loader;

INSERT INTO report_schema.t_example_dashboard_loader

SELECT * FROM report_schema.v_example_dashboard_loader;

VACUUM report_schema.t_example_dashboard_loader;

CREATE OR REPLACE VIEW report_schema.v_example_dashboard_userfacing

AS

SELECT * FROM report_schema.t_example_dashboard_loader;

If we define “one neck” as one staging table (no view on top), and “two necks” as two staging/neck tables with one head view on top (which swaps between the neck tables), we can compare processing time.

a. One neck

TRUNCATE TABLE report_schema.t_example_dashboard_loader;

INSERT INTO report_schema.t_example_dashboard_loader

SELECT * FROM report_schema.v_example_dashboard_loader;

1. 7686ms

2. 6271ms

3. 6109ms

average: 6689ms, or about 6.689 seconds

b. Two necks (only the duration of the statements that affect the user-facing table need to be compared, because only these steps affect users).

CREATE OR REPLACE VIEW report_schema.v_example_dashboard_userfacing

AS

SELECT *

FROM report_schema.t_example_dashboard_loader_backup;

CREATE OR REPLACE VIEW report_schema.v_asset_dashboard_userfacing

AS

SELECT *

FROM report_schema.t_asset_dashboard_loader;

2462ms

1221ms

1405ms

average: 1696ms, or 1.696 seconds

Note: when we doubled or even quadrupled the processing power for the cluster size, and added another cluster node, we did not see clear performance improvements.

Conclusion

In our test results, the “two necks” approach addresses all four antipattern points introduced above: avoid tight coupling, decrease process time and user waiting, have a backup, and speed up user queries. We share these test results as an example where this pattern worked for us to challenge a common pattern (a view in the warehouse) so you can consider this two necks pattern for your own solution architecture.

Regarding processing time, the fastest run of two necks (with swapping the view to the staging table not undergoing processing) is more than 5x faster than the fastest run of the old way. On average, the new way is almost 4x faster. We expect these numbers to be more significant with larger data sets in billions of rows, after accumulation for months or years of warehouse data, rather than in our warehouse that was new. Our test results and sql durations are based on actual client test results based on one view, and our results are not necessarily representative of yours.

For further help with using Tableau with Redshift, you can refer to Tableau’s documentation here. Plaster Group’s Data and Analytics Team partners with Tableau and is happy to help you with any additional questions you might have.