Monday, February 28, 2011

Slowly changing dimensions modeling tips

Modeling slowly changing dimensions

Oracle is only a computer program, and therefore does exactly what it is coded to do (most of the time). Quantum mechanics really do not seem to have any relevance in the Oracle world because it is coded with specific outcomes to every true-or-false question. Logic gates are applied to bits, and the programmed outcome takes place. Non-programmed outcomes result in errors (ORA-600, ORA-7445).

However, things get more complex when we are modeling new environments because we end up dealing with conceptual outcomes with multiple possibilities. This is especially true in dimensional modeling, because dimensions are supposed to hold factual lookup data. How can we record a fact when important data regarding that fact may change?

For example, imagine a tax company that keeps track of deductions found in their various offices and wishes to form a star schema that will allow past and present analysis.

Let CLIENT_DIM record = E. Schrödinger, 3 dependents
Let TIME_DIM record = 2006 tax year
Let LOCATION_DIM record = Virginia Beach, VA

The data will come together in our fact table to show that he had $25,000 in deductions for this combination of dimensional data. This “fact” is currently safe. During the next year, E. Schrödinger and his wife have a lovely baby boy (at least in this universe!).

Let CLIENT_DIM record = E. Schrödinger, 4 dependents
Let TIME_DIM record = 2007 tax year
Let LOCATION_DIM record = Virginia Beach, VA

The data will once again come together in our fact table, this time reporting $28,000 in deductions (I’m not a CPA, don’t get me on tax code) due to the extra dependent. This “fact” is currently safe, just as the last one was.

However, we now have a paradox in our data. In 2007, we report $28,000 in deductions, which was based on the fact that E. Schrödinger had four dependents. In 2006, we reported $25,000 in deductions due to E. Schrödinger’s 3 dependents. When we run our analytic reports for 2007, we will get great results; we will be able to break down the deductions and the number of dependents will play a proper role in these calculations.

But when we run our reports against 2006, the deduction calculations will not compute properly. The CLIENT_DIM record will show 4 dependents, but the deduction amount for the 2006 year will have been based upon 3 dependents. Our dimensional data (CLIENT_DIM record) changed over time.

This is known as a slowly changing dimension (or slowly changing dimension). Though we may not realize it, almost every dimension is in fact slowly changing; stores may move, clients may die (especially if they hang out with sadistic quantum physicists with ready supplies of hydrocyanic acid), and even our human definition of time can change over time (consider the changes to daylight savings this year).

However, we don’t have to worry about all of these possible changes; we only have to worry about the ones pertaining to the facts on which we are attempting to report. Our business needs, in the end, determine which dimensions must be slowly changing.

There are three types of slowly changing dimensions: Type 1, Type 2, and Type 3. Each of these types tries to help the designer of the star schema eliminate paradox from their dimensional model (just as the three interpretations of the Schrödinger’s Cat thought experiment tries to eliminate the paradox of the living dead).

  • Type 1 slowly changing dimension: Overwrite the old value with the new value and call it a day. This is very useful when dealing with issues such as typos on the client’s name. We don’t care about the history in this case because it was incorrect anyways.
  • Type 2 slowly changing dimension: Create a new record in the dimension with a new primary key. In the example we’ve given, there would be two records in CLIENT_DIM for E. Schrödinger, one in which he has 3 dependents and once in which he has 4. Though he is one person from the business point of view, he is two people from a dimensional point of view.
  • Type 3 slowly changing dimension: Overwrite the old value with the new value, and add additional data to the table such as the effective date of the change. This type of slowly changing dimension resolution would be beneficial if there is a change that can happen once and only once (such as death).

These three types of slowly changing dimension resolution usually help in resolving changes to “factual” lookup data. However, we can see clear correlations between these three types of resolutions and the three interpretations of the Schrödinger’s Cat thought experiment!

slowly changing dimension Management Type 1

This clearly matches up with the Copenhagen Interpretation of the Schrödinger’s Cat thought experiment. In the Copenhagen Interpretation, the state of the cat changes and all other states are discarded as the waveform collapses. Criticism to this interpretation applies as well to Type 1 slowly changing dimension resolution. The Copenhagen Interpretation ignores the possibility of reconstruction; in quantum mechanics, it must be possible to return to any original state before measurement was taken place. In our star schema, it also ignores the possibility of reconstruction; we will not be able to return to the original state or even acknowledge that a previous state existed for the purpose of analytics.

slowly changing dimension Management Type 2

This matches with the Many Worlds Interpretation of the Schrödinger’s Cat thought experiment. Instead of completely destroying the other possible waveform, we simply maintain that the two possibilities decohere and form their own universes that will no longer share any correlation. This interpretation shares a similar problem with Type 2 slowly changing dimension resolution. By spawning new records (universes) as the outcome of a changing event, we create multiple possibilities that no longer share any correlation.

For instance, if E. Schrödinger has a new dependent, there will be two resulting rows: one in which has 3 dependents and one in which he has 4. If he then legally changes his last name to Schroedinger, we will have to record that change since it is important for tax records. Because of this, we will have three total records for this one client (E. Schrödinger with 3 dependents, E. Schrödinger with 4 dependents, and E. Schroedinger with 4 dependents). These three records will not have any correlation unless we create some sort of superkey that properly identifies a single person and their many instances. This will be important if we will be doing mining that incorporates multiple times, clients, and locations in our analysis.

slowly changing dimension Management Type 3

This matches with the Many Histories Interpretation of the Schrödinger’s Cat thought experiment. When an observed outcome occurs (like the birth of a new child), the old record is changed to reflect the new “real” outcome; in this case, the addition of a single dependent. However, the change is noted by the addition of a column such as an effective date, to show that this is not the only outcome that has ever existed, but it is THE outcome that does exist at this time.

In effect, the old outcome (3 dependents) still exists, but is discarded, ignored, and irretrievable now that the outcomes are decoherent. In our star schema, this type of resolution will only provide us with a confusing result of “This is the case now, but it was not always so. It changed on ….” For some situations, as with Type 1 resolution, this will suffice, such as the death of a client. We only need to record that death once. However, if any new data enters our model after the death of the client (post-mortem taxes?) or if the status changes again (miraculous recovery!), we may have unreliable report output.

Conclusion

In quantum mechanics, we are calculating multiple events that happen at the same exact point in time (or over a period of unknown time); whereas in a data warehouse we are dealing with history and fully acknowledge the changes time may have on our data. However, one cannot help but notice the correlation between the two forms of paradox and their resolutions.

In fact, in data warehousing the so-called Schrödinger’s Cat paradox becomes even more problematic because we are forced to not only predict future outcomes based on dimensional data, but to also report on past/present information based on the same data. Physicists attempting to provide interpretations of this thought experiment only have to worry about future conditions; the past and present are unmeasured, and therefore have no relevance on the problem except that they are in a quantum state.

They seek to explain the future of the cat once observation/decoherence have taken place. If Schrödinger’s Cat were in a data warehouse, we would have to analyze the entire life of the cat, the cat inside that horrible box of doom, and try to figure out whether the cat will want dry food, wet food, or a good burial place after the experiment is finished.


SOURCE:http://www.dba-oracle.com/t_slowly_changing_dimensions.htm

0 comments: