Monday, February 28, 2011

Slowly Changing Dimensions SCD in Dimensional Modeling

Slowly Changing Dimensions

Entities change over time. Customer demographics, product characteristics, classification rules, status of customers etc. lead to changes in the attributes of dimensions. In a transaction system, many a times the change is overwritten and track of change is lost.

For example a source system may have only the latest customer PIN Code, as it is needed to send the marketing and billing statements. However, a data warehouse needs to maintain all the previous PIN Codes as well, because we need to track on how many customers move to new locations over what frequency.

A key benefit for Data Warehouse is to provide historical information, which is typically over-written (and thus lost)in the transaction systems. How to handle slowly changing dimensions in a Dimensional Model is a key determinant to that benefit.
There are three ways to handle the same:
Slowly Changing Dimension method 1 (In short SCD 1)

The way most of the source systems will handle it- Overwrite the attribute value. For example if a customer’s marital status has moved from 'Unmarried' to 'Married', we over-write 'unmarried' to 'Married'. Similarly, if an insurance policy status has moved from 'Lapsed' to 'Re-instated' the new status is over written on the old status. This is obviously done, when we are not analyzing the historical information.
Slowly Changing Dimension Method 2 (in short SCD 2)

This is the true-blue technique to deliver precise historical analysis. This is used, when there is more than one change in the attributes of an entity, and we need to track the date of change of the attribute.

In this method, a new record is added whereby the new record is given a separate identifier as the primary key. We cannot use the production key as the primary key here as it has not changed (Customer ID has remained the same, while the value of its attribute 'marital status' has changed). This new identifier is called the surrogate key.

Apart from adding a new record and providing a new primary (surrogate) key, the validity period for this new record is also added.

For example- You have a dimensional table with customer_ID '110002' with marital status as 'single'. Overtime, customer gets married and also moved to a new location. The customer dimension record will be:
Surrogate Key Customer ID Date Valid Marital Status Date of Birth City
1100021 110002 Sept 23, 2004 Single Jan8, 1982 Palo Alto
1100022 110002 Oct 25, 2005 Married Jan8, 1982 Palo Alto
1100023 110002 Nov 23, 2005 Married Jan8, 1982 San Francisco
Slowly changing dimension method 3 (SCD 3)

This is a mid-way between method 1 and method 2. Here we don’t add an additional record, but add a new field 'old attribute value'. However, this has limitations. This method has to know from the beginning on what attributes will change. This is because a new field/attribute has to be added in the design for every attribute, which can change. Secondly, attribute can change maximum once in the lifetime of the entity OR at least the lifetime of the data warehouse.
Surrogate Key Customer ID Marital Status Date of Birth City Marital Status Old City Old
1100021 110002 Married Jan8, 1982 San Francisco Single Palo Alto

NOTE – The term of 'Slowly changing dimension' is used because of it being a universally acknowledged term. However, the same methods will apply to fast changing dimensions as well.
Surrogate Keys as Primary keys of dimension tables

There is a best practice in dimensional model design to not to use the production primary key as the primary key for the dimension table. This goes against conventional logic, but has a reason.

Data Warehouse has a core need for maintaining historical information and how an entity has moved and changed shape through the passage of time. Typically Source Systems need for this kind of information is quite less. In case of historical tracking in source systems, these systems can have the luxury of using multiple-field primary key (including the key identifier of the entity plus date stamp). For example if an insurance policy is lapsed and after two months it becomes reinstated, one can use the primary key as a combination of Policy number+ date/time +the status in the 'policy history table'. However, Data Warehouse doesn’t recommend the luxury of using multiple field primary key in dimension table.

Therefore, the concept 'surrogate Key' comes into play where the primary key is not the production key, but a key generated by the system. The production key is also used as an attribute within the same dimension table.
The situations/reasons on when a surrogate key is used:

* 'Slowly changing dimensions'
* When the primary key itself is repeated.
* When there is a multiple field primary key. Dimension model typically does not use multiple field primary key to link to the fact table.

Therefore it is always recommended to use surrogate keys. it is difficult to find the organizations, which will not face the situations as highlighted above. If there are, they could as well manage their needs using excel and pivot tables.

source:http://www.executionmih.com/data-warehouse/slowly-changing-dimension-SCD.php

0 comments: