Spyglass MTG Blog

Dimensional Drift: A short guide on SCD Strategies for Data Architects and Data Engineers

Written by Yassaman Garemani | Aug 21, 2024 2:15:00 PM

Data evolves over time, attributes shift, values transform, and historical context matters. Slowly Changing Dimensions or SCDs are the key to managing these dynamics and a must have toolset for data engineers and architects.

SCDs are attributes in a database whose values change over time. The decision on how to handle and support these changes define how historical data is recorded and how data accuracy and integrity is maintained over time. While SCDs primarily apply to dimension tables, for example to Customer or Product table attributes, fact tables can also be affected by SCD implementation strategies because fact tables can refer to those dimensions. SCDs can also apply to non-quantitative attributes in a fact table, such as the status of an order, that can change from ‘Open’ to ‘Closed’ or ‘Cancelled’ for example.

What are the most common types of SCD?

When value of an attribute changes in a data row, the steps that we follow to support this change determines the type of SCD. SCD is implemented at the column or attribute level, and we may choose to implement different SCD types for different columns in a table. Where appropriate, we may also decide to apply a combination of these methods to a single attribute.

SCD Type 0

This is a scenario where no changes are allowed. For example, you can’t change the date when a customer initially opened an account or made a purchase. You also can’t change an attribute which is part of the primary key.

SCD Type 1

For this type, we overwrite any changes to non-key columns and no history is maintained (referred to as Upsert). An example of an SCD type 1 attribute would be customer’s address where we only care about the current value. If a customer’s address changes, we update it and do not keep any history. An SCD 1 type attribute cannot be a primary key or part of a natural multi-column (complex) primary key sequence. If a primary key attribute is changed, we insert a new row with the new key.

SCD Type 2

In this scenario, a new row is added when an attribute value is updated. If the attribute is a non-key column, we insert a new row with the same primary key as the old row and the updated attribute value. To implement this behavior, we need at least 2 additional columns in our table:

  1. An active_flag or a similar flag to represent the current state of the row. The old row becomes ‘Inactive’ (active_flag = ‘N’) and the newly added row becomes ‘active’ (active_flag = ‘Y’).
  2. An effective_date which is set to the date that the new row becomes active.
  3. Alternatively, we can have a start_date and an end_date When an attribute value is changed and a new row is added, start_date is set to the current timestamp to activate the row and the end_date can be set to a date well into the future. We deactivate the old row by setting end_date to the date that the row was deactivated. It is still a good idea to add the active_flag column to keep accurate state, but it is not necessary.

In this way, we are keeping a history of changes over time. For example, we may want to keep a history of customer address changes to analyze if customers changing zip codes could be contributing to declining sales in stores in the surrounding areas during the past 12 months.

You may also implement a sequence_no or version_no in place of or in conjunction with the active_flag. It allows you to keep sequential history of how a record has been modified over time. The initial sequence_no will be set to ‘0’ and it will increment with the addition of each updated row. The row with the highest sequence number is the most current.

SCD Type 3

With this strategy, we keep record of last ‘n’ changes to an attribute within the row itself by adding additional columns to the table to keep the last ‘n’ updated values. ‘n’ can be any value greater than or equal to 1, but it is commonly implemented as ‘1’ with this type of SCD. For example, we may only want to keep record of customers’ current and last addresses, in which case we would have a curr_addr column and a last_addr column. When an address changes, last_addr is set to ‘curr_addr’ and curr_addr is updated to customer’s new address.

SCD Type 4

SCD type 4 applies when we want to implement SCD type 2 for an attribute whose value changes frequently. Adding a new row for each update can rapidly increase the size of the table. With this strategy, we maintain a history table separate from the main table. We apply the update in place to the current row in the main table and add a new row for the old value to the history table. We are not required to copy the entire row either, you can only write what you need to the history. For example, we want to implement SCD type 4 for order_status column of our Order table. We create a table called Order_status_history with four columns: Order_id, old_status, new_status, update_timestamp. When on order with order_id=’xxxxx’ changes status from ‘Open’ to ‘Closed’, the order_status column of the order row in the main table is changed to ‘Closed’ and a new row is added to the history table as (“xxxx”, “Open”, ”Closed”, “Current timestamp”). Note that the ‘new_status’ column is optional.

In a slightly different variation of this method, we keep adding the historical rows to the main table for a specified period (say 24 hours or one week). We then take a snapshot of the main table at the end of the period, add the history rows to the history table and delete them from the main table. The advantage is that the snapshot process can be run independently by dedicated data pipelines and can be scheduled during off peak hours. Also, each periodic snapshot can be written to a separate table which can make processing and storage more efficient. For example, we can schedule weekly snapshots at the end of each week that write to individual weekly tables. This method is not an official variation but one that was successfully implemented for one of our clients.

SCD Type 6

This method implements a hybrid version of types 2 and 3. For each row, we add a curr_value column and a last_value column, as in SCD type 3, plus an active_flag column and an effective_date column (or a start_data and an end_date), as in SCD type 2. When a change occurs:

  1. A new row is added, last_value and curr_value columns are set to the new attribute value, effective_date is updated to the current timestamp, record is marked as ‘Active’.
  2. The status of the last active row is set to ‘Inactive’, but curr_value is updated to the new current value.
  3. For all other historical rows associated with this attribute, cur_value column is updated to the most current value as well.
Conclusion

Mastering SCD is vital for maintaining data warehouse integrity and historical accuracy. By understanding different SCD types, challenges, and best practices, you can make informed decisions that greatly benefit your data strategy. If you have any questions or would like to discuss further, please contact us today!