Category Archives: Data Warehousing

Slowly Changing Dimensions

Since the facts in your fact tables mark discrete events in time, you would expect them to change pretty often. Your dimensions, on the other hand, tend to be pretty stable. An Employee dimension, for example, wouldn’t change very often. But, it will change. Employees move, get married, croak, or quit.

To handle these changes, data warehouses have adopted the concept of slowly changing dimensions. They are categorized into three types.

Type 1 is the basic type, with it previous versions of the dimension are overwritten with the latest data. Changes are not tracked. This can be OK, based on what the business will be doing with the data.

Take the example of an Employee dimension. Our star employee, Hortence McGillicutty moves, and updates her address. With the warehouse, the only business task done with the address is mailings. In this case, we don’t care what her address was two years ago, we only need her current address for mailing her paycheck and other related data. In this case, loosing the history of her previous address is fine, so a Type 1 dimension is a good choice.

Let’s say, however, that the business uses the Employee dimension much more extensively. Perhaps we have an application that on the left side of the screen displays information pertaining to the most recent financial statements. On the right is a scanned in image where our star Hortence McGillicutty has signed off on these statements.

A few months go by, and our heroine falls in love and gets married. If we were still using a Type 1, the scanned in image still has her maiden name, but the display to the left now shows her new married name, Hortence Hollywogger. Not a big deal you say? How about the first time a Sarbanes-Oxley auditor comes in, finds the discrepancy, and threatens big fines.

To solve this, data warehousing has Type 2 slowly changing dimensions. With Type 2, from date and to date fields are added to the dimensional table. These dates are the valid dates for that record. In our example, once Hortence got married we’d have two rows in the employee dimension:

Key EmpId Name From To
123 555 Hortence McGillicutty 07/17/2001 06/15/2007
123 555 Hortence Hollywogger 06/16/2007 12/31/9999

The warehouse could then look up the correct name based on the date. In the case of our above example, the system would look up based on the date the report was signed, discover it was during the 07/17/2001-06/15/2007 time frame, and display her name correctly, leaving our auditor looking elsewhere for imperfections.

I mentioned there was a third type, which predicibly is named Type 3. With type 3, you don’t add new rows, but new columns to the table to handle changes. As you might guess, this can become quite a maintenance nightmare, and is rarely if ever used. Like Microsoft Bob ( http://en.wikipedia.org/wiki/Microsoft_bob ), Type 3 is probably best forgotten about.

To summarize, use Type 1 when tracking changes to the data is not necessary. Use Type 2 when you need to know when those changes were made. How do you decide?

That’s where your customers become involved. It will be necessary to know what they plan to do with that information. Yes, I know, talking to customers can be a scary experience, but hey just envision them using Microsoft Bob and you’ll be OK.

Advertisements

Conformed Dimensions

Yesterday I talked about the differences in dimensions versus facts. Today I’d like to extend that discussion with the importance of Conformed Dimensions.

One of the major advantages of a data warehouse is the ability to combine data from various, and sometimes vastly different, systems. Let’s take a common problem: your company has three different systems, sales, production, and purchasing. You’ve bought these from three different vendors, so unfortunately the part numbers used throughout the systems are not consistent, but you need to generate some reports showing how part x went into product y and was sold to customer z.

Unfortunately the part numbers are not consistent across the three systems because, as I mentioned, they came from three different vendors. What’s a programmer to do?

This is where conformed dimensions come in handy. In the part dimension table you create a surrogate key. This is the new primary key for a part, which is simply a made up value. Maybe you chose to use a GUID, or perhaps it’s just an auto incrementing integer. Regardless, this is now your new “part number” for all 3 systems once you bring the data in the warehouse.

You would add three more fields to the part dimension table. In addition to the primary key you would have a field “saleskey”, a field “productionkey”, and finally a “purchasingkey”. Then, when bringing your sales data into the warehouse, you look up the saleskey in the dimension table, get the primary key for the part, and place it in the fact sales table.

Repeat with production and purchasing systems. By now you are beginning to get the idea. Because you have conformed the part key across the three fact tables, you can now draw reports using the new part key as a common thread to join the various fact tables together.

This process is known as a conformed dimension. ALL of your dimensions in your warehouse need to be conformed if you want to truly leverage the power of your warehouse. Employees, parts, customers, and locations are just a few examples of dimensions you’d want to conform.

As you can see, having conformed dimensions is key to the success of your warehouse. Failure to conform your dimensions means you loose one of the most powerful features of warehousing, the ability to produce reports across differing systems.

Dimensions versus Facts in Data Warehousing

I’ve made mention before of a large data warehousing project I’ve been involved with, using the SQL Server 2005 tools. Like a lot of developers on a lot of projects, I was thrown in the deep end and had to learn a lot of new technologies in a short order of time. Fortunately I’m a glutton for punishment.

I found a lot of material on the various tools, and quickly became competent in their use. I also quickly got up to speed on star schema versus snow flake schema, and the divvying up of data into facts and dimensions.

The thing that always puzzled me was, how do you decide what goes into facts versus what goes into your dimensions. Having a short breather I decided to do some reading up to make sure my understanding of theory was up to my understanding of the tools.

The best explanation I’ve found is in “The Microsoft Data Warehouse Toolkit” by Joy Mundy and Warren Thornthwaite. http://shrinkster.com/r4n (Good book, highly recommend it, and standard disclaimer I don’t get any kickbacks so buy it where ever you want.) It covers the Kimball method for warehousing. To quote…

It may help to think of dimensions as things or objects. A thing such as a product can exist without ever being involved in a business event.

Ah, grasshopper, enlightenment begins.

As the book describes, a dimension is your noun. It is something than can exist independent of a business event, such as a sale. Products, employees, equipment, are all things that exist. A dimension either does something, or has something done to it.

Employees sell, customers buy. Employees and customers are examples of dimensions, they do.

Products are sold, they are also dimensions as they have something done to them.

Facts, to carry on another concept from the book are the verb. An entry in a fact table marks a discrete event that happens to something from the dimension table. A product sale would be recorded in a fact table. The event of the sale would be noted by what product was sold, which employee sold it, and which customer bought it. Product, Employee, and Customer are all dimensions that describe the event, the sale.

In addition fact tables also typically have some kind of quantitative data. The quantity sold, the price per item, total price, and so on.

Knowing this makes it much clearer in my mind how to start designing my own warehouses. Or at least ready to take the first step.