Category Archives: DW/BI

An Early Christmas from Redmond

They’re here! Today Microsoft released Visual Studio 2008 RTM. If you have an MSDN subscription you can download today and start producing all those new .Net 3.5 applications.

Also released today via the connect.microsoft.com site is CTP 5 of SQL Server 2008. It’s my understanding in this CTP most of the features are working, except for clustering.

I can see I won’t be getting any sleep tonight.

Advertisements

A Developer’s Guide to Installing SQL Server 2005 – Part 2 – Installing SQL Server 2005 Express Edition with Advanced Services SP2

OK, so you read my instructions on my previous post, and downloaded the appropriate bits from http://msdn2.microsoft.com/en-us/express/bb410792.aspx If you are planning to, are have already installed the Developer Edition, you only need to download the “Microsoft SQL Server 2005 Express Edition with Advanced Services” link. If you are only planning on doing the Express Edition, you’ll also want to download the link under it, “Microsoft SQL Server 2005 Express Edition Toolkit”. My instructions below only cover the database, not the toolkit.

OK, you’ve just downloaded the bits and it’s already time for the first “gotcha”. Before installing, you’ll want to make sure you have the IIS components installed on your machine. Odds are if you are an ASP.NET developer you probably already have them, but if not it’s a good idea to check.

In Vista, go to Control Panel, pick Programs and Features, then “Turn Windows Features on or off”. In the dialog that appears, scroll to Internet Information Services and expand the tree. Make sure all the items under “Web Management Tools” are checked. Now go down to “World Wide Web Services”. Under “Application Development Features” make sure .Net Extensibility, ASP.Net, and the ISAPI items are on. In “Common Http Features” check on everything but redirection. Everything under Health and Diagnostics should be on so you can log errors. Finally, under Security all you need on is Request Filtering.

XP is a little simplier. Under the control panel, Go to Add or Remove Programs, Add or Remove Windows Components. In the dialog that pops up, scroll down to Internet Information Services and check it. Click on details, and add FrontPage 2000 Server Extensions to what’s already checked. Then click OK a few times to get the updates installed.

In either situation, you may need your Windows disk handy to install the additional components.

OK, now you’re finally ready to run the setup. Double click on your sqlexpr_adv.exe to see this screen. All you have to do is check on the “I accept..” to accept the license agreement, then click Next.

clip_image001

Now SQL Server will check to see if it has everything it needs. This is the screen where it would yell at you if you didn’t have IIS installed. Since you do, just click Next.

clip_image002

Now we get a nice welcome screen. I’m feeling all warm and fuzzy now! Click next to continue.

clip_image003

Here SQL Server is going to make sure you have everything you need installed. If there is anything missing you should fix it before you proceed. If you get all greens, you’re good to go.

clip_image004

OK, here we need to deviate from the standard “next next next” default model a lot of us are used to. Make absolutely sure you UNCHECK the box you see below, “Hide advanced configuration options” We definitely want to see those advanced options.

clip_image005

Do you have it unchecked yet? If you don’t, you won’t get Reporting Services or Full Text Searching or the other cool features installed. Go ahead and fill out your name and company. Oh, and did I mention the box should be unchecked? If it is, like the screen below, you can click next.

clip_image006

OK, here we are on that advanced options screen. As you can see, all the cool toys are set to not install. Well that’s no fun. Go through each one and set it to install.

clip_image007

You can see below I have all the items checked on. I want to mention one thing. On the box I’m installing this on, I’ve already got the developer edition installed. So I don’t really need the Client Components. I’ve checked them on so I can show you something later. For now, click Next.

clip_image008

OK, here is your chace to give your server a name. Since everyone in the world uses SQLExpress, you should definitely not use it. Pick a name that’s logical, perhaps your computer name_express or computer_sql. Whatever you pick, change it now click Next.

clip_image009

Remember a few screens back when I said I was installing SQL Express on a machine that already had the Developer version installed? This is where it lets me know. The nice thing is the installer is smart enough not to choke. It simply says “Hey, I’ve already got this piece installed so I’ll skip it.” This is handy so you don’t have to think too much about the pieces you’re installing, you can pick everything and if it’s already there the installer will let you know then go on it’s merry way. Speaking of which, it’s time to get on our merry way by clicking Next.

clip_image010

OK, now it wants to know a few things about how to run. First it wants to know what user id to run the service as. The default you see below will work just fine. Next it wants to what pieces it should run at startup. If you are going to be running this once in a blue moon, I would uncheck the SQL Server and Reporting Services options, then run them manually when you need them. On the other hand, if you are going to be in SQL Server nearly every day, like I am, it makes sense to leave these checked so they are started and ready for you. You’ll also want to leave these on if you are going to let other people connect to your box for testing reasons. Once you make your choices, pick Next.

clip_image011

OK, here SQL Server wants to know how you are logging into the server. DO NOT click next. Instead look at the next screen shot.

clip_image012

I don’t like using strictly windows authentication. Instead I much prefer using Mixed Mode. Pick it, then enter a password you can use for the sa account. I prefer this method as it gives me a little flexibility on how to use the database. Make sure you pick a password you’ll remember! Enter it and yes, you guessed it, click next.

clip_image013

Next it’s asking how you want SQL Server to sort text. I always take the default here, and click next. If you have special sorting needs you can change this, but odds are you can click next too.

clip_image014

This next screen is another one of those “Gotchas”.

clip_image015

This is really important to check on, so you can easily administer your own database. Check it on so it looks like the screen below, and hit next.

clip_image016

OK, time for another important question. You can install Reporting Services using the default config, or you can install but not configure now. The idea behind this was to allow you to install but not configure, then copy the configuration from another existing server. This is ideal for the enterprise where we are setting up one server after another and want them all to be identical. However, for our workstation we can simply install using the default configuration, and hit next.

clip_image017

Here Microsoft would like your help. Any data Microsoft can gather about SQL Server will help make it a better product. In some small way you’ll be contributing to the next version. However, I do recognize there are some people or shops that are wary of anonymous statistics gathering. I turned it on for my install, but if you have any doubts leave it off like I show below. Once your decision is made, click Next.

clip_image018

It’s finally time to begin the install. SQL Server shows you your choices one last time, and gives a big “are you sure”. You are, so just hit next.

clip_image019

OK, go get us a couple of cups of coffee, we’ll sit here and wait a bit. This screen will keep us updated on the install process.

clip_image020

As things are completed, they turn green, as you see here. When everything is installed, the Next button will become enabled. Click it to proceed.

clip_image021

This is the last screen of the install, but don’t blow it off casually. It has some good info. Scroll the message window down to the bottom, as you see here, and you’ll find some good links. First, as you see on the screen, is a link to the Books on Line. You should download these and install them, they will become an often referenced companion and you learn SQL Server. Below that (just off the screen) are links to sample databases. If you are installing on a developer workstation and have the space, you should seriously consider installing these. Many, many authors (including myself, so you’ve been warned) will use these as material for their articles and samples. If nothing else it is very informative to see how Microsoft created a database.

clip_image022

That completes our installation of SQL Server Express with Advances Services. But the fun is just beginning…

A Developer’s Guide to Installing SQL Server 2005 – Part 1 – Selecting a version

As a developer of applications that use SQL Server in some way, it can be valuable to have a database local to your box. It can be used for development, testing, or debugging in an off line environment. While there are many versions of SQL Server 2005, there are only two versions that are really suitable for the developer’s computer: SQL Server Express With Advanced Options, and SQL Server Developers Edition.

The first, SQL Server 2005 Express, is free. There are actually two versions of Express, the standard and the one entitled SQL Server 2005 Express Edition with Advanced Services SP2. It can be a little hard to find, so here’s a handy link: http://msdn2.microsoft.com/en-us/express/bb410792.aspx The standard edition does not include Full Text Search, Reporting Services, or the SQL Server Management Studio Express. These are all features that you, as a developer will want.

The other version of SQL Server that’s geared toward developers is the SQL Server 2005 Developer Edition. This version has the same features as the Enterprise Edition, but it’s only licensed for a single developer to access. It also comes with the full blown BIDS (Business Intelligence Developer Studio) tools. It’s not free, however it’s not expensive either. At only 49.99 it’s priced so even a small one person development shop can easily afford it. This link has more info, including a link to purchase:

http://www.microsoft.com/products/info/product.aspx?view=22&pcid=f544888c-2638-48ed-9f0f-d814e8b93ca0&type=ovr

If you have an MSDN License, the SQL Server 2005 Developer Edition is included with it and can be downloaded via your subscription.

So as a developer, which version version should you install? That answer is easy. Both.

Yes, both. The Express edition will allow you to perform small scale testing, let multiple users bang away at your solution and let you perform some small measure of scalability testing. With it’s database size limited to 4 gig, it may nor may not be big enough to hold your entire database, but it’s certainly large enough for a good hunk of your data. The Developer Edition will give you all of the tools and let you emulate your Enterprise system, at least in terms of the database sizes and structures. However since it’s licensed only for the developer, you won’t be able to have multiple users access it.

So you’ve decided OK, you want to install. If you’re not a trained DBA there a few gotcha’s you should know about when installing SQL Server. By default, not all of the features are installed. In the next few posts, I’ll show step by step instructions on how to install SQL Server for your development workstation.

Just The Facts

I’ve spent the last few days covering dimension tables, but I should briefly mention that there are three types of fact tables as well. The first is the kind most will be familiar with, a transactional fact.

These are straightforward, every time an event occurs in your transaction system, a copy of that event is made in the fact table. Someone sells a product, or updates an inventory record, or some similar event occurs it’s reflected in your transaction fact table. Predominately your warehouse will be made up of these types of tables.

Often times your users will want to be quickly able to reference data that’s aggregated over time; sales to date, sales per week, and so forth. To make getting to these totals fast, the second and third types of fact tables are known as snapshot tables. In addition to speeding the return of data to the user, it also reduces the load on the server as totals are aggregated at one time instead of each time the user runs a report.

The first of the snapshots (which is also the second type of fact table) is called a periodic snapshot fact table. At some defined point in time, a job will run that will aggregate the transactional fact data and place it into the periodic snapshot fact table. Perhaps every Saturday night at midnight a job runs which tallies up the previous weeks sales into a sales for the week fact table. Another classic example is the month ending totals after accounting closes out the business month.

The other type of snapshot, and the third type of fact table, is an accumulating snapshot. An accumulating snapshot is constantly being updated, instead of waiting for some point in time. Perhaps you have a trigger on the transaction sales table. Whenever a sale is made, that trigger updates the sales to date table.

Accumulating snapshots have the advantage of always being up to date and accurate. If you have a lot of them, or the transaction tables that they are based on have frequent updates, they can become a major drain on your system resources. For that reason it’s best to limit, to use them with transaction tables that are seldom updated. Let’s take two examples to help illustrate.

In an average car dealership, I’d guess that they might sell four cars a day. Since the transaction volume is so low, four a day, an accumulating snapshot would be entirely reasonable to track sales to date.

On the other hand, let’s look at a large site like woot.com, that deals with hundreds if not thousands of sales in a very short time span. In their case, an accumulating snapshot would be a major drain on system resources. Instead they would be much better served by a periodic snapshot.

A periodic snapshot, by the way, isn’t limited to a once a week or once a month situation, even though that’s how they are primarily used. You could update your periodic snapshots once a day, once an hour, or even once every 10 minutes if your server could support it. The “periodic” simply means that it runs on a regular schedule rather than constantly.

In addition to your transactional fact data, you should review your reporting needs to see if either of the two types of snapshots, periodic or accumulating, could be of use in your warehouse to speed data reporting and reduce the load on your servers.

Slowly Changing Dimensions

Since the facts in your fact tables mark discrete events in time, you would expect them to change pretty often. Your dimensions, on the other hand, tend to be pretty stable. An Employee dimension, for example, wouldn’t change very often. But, it will change. Employees move, get married, croak, or quit.

To handle these changes, data warehouses have adopted the concept of slowly changing dimensions. They are categorized into three types.

Type 1 is the basic type, with it previous versions of the dimension are overwritten with the latest data. Changes are not tracked. This can be OK, based on what the business will be doing with the data.

Take the example of an Employee dimension. Our star employee, Hortence McGillicutty moves, and updates her address. With the warehouse, the only business task done with the address is mailings. In this case, we don’t care what her address was two years ago, we only need her current address for mailing her paycheck and other related data. In this case, loosing the history of her previous address is fine, so a Type 1 dimension is a good choice.

Let’s say, however, that the business uses the Employee dimension much more extensively. Perhaps we have an application that on the left side of the screen displays information pertaining to the most recent financial statements. On the right is a scanned in image where our star Hortence McGillicutty has signed off on these statements.

A few months go by, and our heroine falls in love and gets married. If we were still using a Type 1, the scanned in image still has her maiden name, but the display to the left now shows her new married name, Hortence Hollywogger. Not a big deal you say? How about the first time a Sarbanes-Oxley auditor comes in, finds the discrepancy, and threatens big fines.

To solve this, data warehousing has Type 2 slowly changing dimensions. With Type 2, from date and to date fields are added to the dimensional table. These dates are the valid dates for that record. In our example, once Hortence got married we’d have two rows in the employee dimension:

Key EmpId Name From To
123 555 Hortence McGillicutty 07/17/2001 06/15/2007
123 555 Hortence Hollywogger 06/16/2007 12/31/9999

The warehouse could then look up the correct name based on the date. In the case of our above example, the system would look up based on the date the report was signed, discover it was during the 07/17/2001-06/15/2007 time frame, and display her name correctly, leaving our auditor looking elsewhere for imperfections.

I mentioned there was a third type, which predicibly is named Type 3. With type 3, you don’t add new rows, but new columns to the table to handle changes. As you might guess, this can become quite a maintenance nightmare, and is rarely if ever used. Like Microsoft Bob ( http://en.wikipedia.org/wiki/Microsoft_bob ), Type 3 is probably best forgotten about.

To summarize, use Type 1 when tracking changes to the data is not necessary. Use Type 2 when you need to know when those changes were made. How do you decide?

That’s where your customers become involved. It will be necessary to know what they plan to do with that information. Yes, I know, talking to customers can be a scary experience, but hey just envision them using Microsoft Bob and you’ll be OK.

Conformed Dimensions

Yesterday I talked about the differences in dimensions versus facts. Today I’d like to extend that discussion with the importance of Conformed Dimensions.

One of the major advantages of a data warehouse is the ability to combine data from various, and sometimes vastly different, systems. Let’s take a common problem: your company has three different systems, sales, production, and purchasing. You’ve bought these from three different vendors, so unfortunately the part numbers used throughout the systems are not consistent, but you need to generate some reports showing how part x went into product y and was sold to customer z.

Unfortunately the part numbers are not consistent across the three systems because, as I mentioned, they came from three different vendors. What’s a programmer to do?

This is where conformed dimensions come in handy. In the part dimension table you create a surrogate key. This is the new primary key for a part, which is simply a made up value. Maybe you chose to use a GUID, or perhaps it’s just an auto incrementing integer. Regardless, this is now your new “part number” for all 3 systems once you bring the data in the warehouse.

You would add three more fields to the part dimension table. In addition to the primary key you would have a field “saleskey”, a field “productionkey”, and finally a “purchasingkey”. Then, when bringing your sales data into the warehouse, you look up the saleskey in the dimension table, get the primary key for the part, and place it in the fact sales table.

Repeat with production and purchasing systems. By now you are beginning to get the idea. Because you have conformed the part key across the three fact tables, you can now draw reports using the new part key as a common thread to join the various fact tables together.

This process is known as a conformed dimension. ALL of your dimensions in your warehouse need to be conformed if you want to truly leverage the power of your warehouse. Employees, parts, customers, and locations are just a few examples of dimensions you’d want to conform.

As you can see, having conformed dimensions is key to the success of your warehouse. Failure to conform your dimensions means you loose one of the most powerful features of warehousing, the ability to produce reports across differing systems.

Dimensions versus Facts in Data Warehousing

I’ve made mention before of a large data warehousing project I’ve been involved with, using the SQL Server 2005 tools. Like a lot of developers on a lot of projects, I was thrown in the deep end and had to learn a lot of new technologies in a short order of time. Fortunately I’m a glutton for punishment.

I found a lot of material on the various tools, and quickly became competent in their use. I also quickly got up to speed on star schema versus snow flake schema, and the divvying up of data into facts and dimensions.

The thing that always puzzled me was, how do you decide what goes into facts versus what goes into your dimensions. Having a short breather I decided to do some reading up to make sure my understanding of theory was up to my understanding of the tools.

The best explanation I’ve found is in “The Microsoft Data Warehouse Toolkit” by Joy Mundy and Warren Thornthwaite. http://shrinkster.com/r4n (Good book, highly recommend it, and standard disclaimer I don’t get any kickbacks so buy it where ever you want.) It covers the Kimball method for warehousing. To quote…

It may help to think of dimensions as things or objects. A thing such as a product can exist without ever being involved in a business event.

Ah, grasshopper, enlightenment begins.

As the book describes, a dimension is your noun. It is something than can exist independent of a business event, such as a sale. Products, employees, equipment, are all things that exist. A dimension either does something, or has something done to it.

Employees sell, customers buy. Employees and customers are examples of dimensions, they do.

Products are sold, they are also dimensions as they have something done to them.

Facts, to carry on another concept from the book are the verb. An entry in a fact table marks a discrete event that happens to something from the dimension table. A product sale would be recorded in a fact table. The event of the sale would be noted by what product was sold, which employee sold it, and which customer bought it. Product, Employee, and Customer are all dimensions that describe the event, the sale.

In addition fact tables also typically have some kind of quantitative data. The quantity sold, the price per item, total price, and so on.

Knowing this makes it much clearer in my mind how to start designing my own warehouses. Or at least ready to take the first step.