At work I’ve been developing a big suite of packages to convert data from an Oracle system into our Data Warehouse. A lot of the supporting tables are almost a straight pull, except we are of course adding our own primary key, and then setting up a non clustered unique index on what had been the primary key in the old system.
A few tables have been driving us batty though, giving us “duplicate value” errors when trying to insert the rows from our SSIS package. The first thing I did to try and track down the problem was create an error table, and instead of having the package fail have it redirect error rows to my new error table. In case you are wondering, this is going to be a “one time shot” use for the packages, so we chose not to invest a lot of time and effort into error handling. We either want all the rows or none, and we’ll be running the packages manually so we’ll be there to know the results. But I digress.
When I went to look at the error table, it had all the rows from our source system in it. I scratched my head, thinking that can’t be right. A quick search found the answer in the Technet Forums. I needed to go into the OLE DB Destination and set the Max Commit count to 1. Of course you wouldn’t want to leave it like that for production, but for debugging it worked great. Once I did that, I was able to rerun the package and quickly identify my misbehaving row.
Next I looked at the value, and then looked for a similar value in my table. What I found was my source system had two rows, something like this example:
Yes, the only difference was the second row had a lowercase letter at the beginning of the second word. Our Oracle instance had case sensitivity turned on. To it, these were two entirely different values. However, by default SQL Server is case insensitive; to it these two were the same. So my dilemma, how to fix this one column without having to alter my entire database?
It turns out there is an option in the Create Table syntax to set the collation. First, you should find out what your collation is currently set to. This is easy enough, just open SQL Server Management Studio, right click on the database and pick properties. Right there on the front page is the Collation.
Alternatively I could have run this SQL in SSMS (substitute your database name where I have AdventureWorks2008):
select databasepropertyex('AdventureWorks2008', 'collation')
Either way, in this example the default is SQL_Latin1_General_CP1_CI_AS. The important thing to note is the “_CI_”, which indicates case insensitivity. If we wanted to set the entire database, we would issue commands to change this to SQL_Latin1_General_CP1_CS_AS, which stands for case sensitivity. But as I said, in my case I don’t want to affect the entire database, so instead I will use this collation name in the create table syntax. Here is a simple example:
create table TestTable ( BogusPK bigint identity , FieldFromOracle varchar(200) collate SQL_Latin1_General_CP1_CS_AS not null , AnotherField varchar(200) null )
All that I had to do was insert the collate clause between the data type and the not null clause. Note that this only affects the one column I had an issue with. FieldFromOracle is now a case sensitive column, I can add “Arcane Code” and “Arcane code” and still be able to add a unique index. The second column, here named “AnotherField” will remain case insensitive, the behavior you normally expect.
Before I wrap this up, I know someone will point out that allowing primary keys in your system that only differ in case is bad practice. For the record I totally agree, however this is a soon to be legacy system built by a vendor. Additionally, for various reasons I was not allowed to do any data cleansing to the source system. Just pull it like it is and put it in the warehouse. I imagine most of you are like me, that you don’t get to live in the ideal world, so hopefully knowing how to diagnose and deal with collation issues between databases will make your life a little easier.