Boy Howdy Those Deep Fried Bytes Are Yummy

Long time readers of my blog or Twitter posts will know I am a big fan of podcasts. There’s a new one worth taking a listen to:

Deep Fried Bytes

Deep Fried Bytes is a new podcast hosted by Mississippi MVP Keith Elder and Chris “Woody” Woodruff. I listened to their inaugural episode on the way to the office this morning and quite enjoyed it. While they will cover all aspects of technology, they will have a heavy focus on .Net development.

The audio quality was superb, it may have been a first episode but their production quality and format made it sound like they’d been podcasting for years. I’ve already added the show to my Zune as a subscription, and recommended it to the Zune Marketplace. I’m eagerly looking forward to the next episode!

Creating and Customizing Noise Words in SQL Server 2005 Full Text Search

SQL Server 2005 Full Text Search is a powerful feature. It will allow you to search for words in your text fields and get results back in a timely manner. However, certain words are so common they get in the way of effective searching. Having these words indexed would be worthless in terms of effective searching. Words like a, an, the, or, and so on. SQL Server 2005 refers to these words as “noise words”.

The list of noise words that ships with SQL Server 2005 is fine for most users. However, there are times when it can be an advantage to add words to the list. For example, let’s say your company has a rule that it’s name must appear in the header of all internal Word documents. Further, let’s say that all of those Word documents are stored in a varbinary(max) field and full text indexed. Your company name would effectively become a noise word, searching for it would return every record in your table and thus be useless in terms of results. You would then want to add your company name to the list of noise words for SQL Server 2005 to ignore.

Or, let’s take the opposite example. There was a word in the default SQL Server 2005 noise word list that, when in all caps, was an acronym for a piece of equipment in our plants. Thus it was getting omitted from searches. We had to remove that word from the noise word list.

So, how do we change the list of noise words in SQL Server 2005? First we have to track it down. Open up regedit and navigate to this spot in the tree.

HKEY_LOCAL_MACHINE -> SOFTWARE -> Microsoft -> Microsoft SQL Server -> [insert your instance name here] -> MSSearch -> Language -> [insert your language abbreviation here]

My instance name is MSSQL.1. Pick your language, since I live in the US, my language will be enu, short for English, US. Don’t confuse it with eng, which is for our friends in Great Britain. Now look for the name NoiseFile. The value for mine is:

C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\FTData\noiseenu.txt.

Use any text editor to open and look at the file, it’s a simple text file. Mine only had 129 words in it. Now simply add or remove the words you want (or don’t want) in the noise word list. When done, save the file. If you are using SQL Server 2005 Developer Edition on Vista, make sure to open the text file with admin privileges so you’ll be able to save your changes.

Now for the bad news. (Hey, you didn’t think it was going to be that easy did you?) In order for the noise word file to take effect you must repopulate the full text indexes. You can do the entire catalog at once, or one table at a time. To do the entire catalog, use the alter full text catalog command. For more info on this command, see my post on Getting Started with SQL Server 2005 Full Text Searching: Part 1 – The Catalog.

To change things on a table by table basis, you can use the alter full text index command. Again, rather than repeating myself I’ll refer you back to my post Getting Started with SQL Server 2005 Full Text Searching: Part 2 – The Indexes. Failure to rebuild your indexes will result in bad results. As new items are indexed, or old updated, the new noise word list will apply as SQL Server 2005 applies the full text index to the record. If old records are not updated, you’ll have records that should (or shouldn’t) be in the results and some unhappy users.

Bear in mind one important fact, changes to the noise word file apply to the entire SQL Server, and all the databases running on it. Thus if you include or exclude noise words to support one app, you are affecting every app that touches every full text indexed database on your server.

Finally, if you read my post yesterday on Thesaurus Entries in SQL Server 2005 and 2008, you may wonder why no mention of 2008 and noise words? In SQL Server 2008 noise words have been replaced with a concept called stopwords. I’ll be talking about stopwords in a post in the very near future.

Creating Custom Thesaurus Entries in SQL Server 2005 and 2008 Full Text Search

SQL Server Full Text Search has the ability to search not just for a word, but for various forms of a word based upon the built in thesaurus. To use this ability in your own full text searches, you simply use the FormsOf clause. Here’s an example using the AdventureWorks database, with a full text index on the product description table’s Description field.

select [Name], ProductNumber, [Description]

from [Production].[Product] p

   , [Production].[ProductDescription] pd

   , [Production].[ProductModelProductDescriptionCulture] pmpdc

where p.ProductModelID = pmpdc.ProductModelID

  and pmpdc.ProductDescriptionID = pd.ProductDescriptionID

  and CONTAINS(pd.[Description], ‘FORMSOF(Thesaurus, light)’ )

One of the biggest questions I get asked is, how do we customize the Thesaurus file used by SQL Server? Today we’ll look at how to do just that.

The first step is tracking it down. SQL Server stores the location of the thesaurus in the registry. The location depends on whether you are in SQL Server 2005 or 2008. For SQL Server 2005, open up regedit or your favorite editor and navigate to:

HKEY_LOCAL_MACHINE -> SOFTWARE -> Microsoft -> Microsoft SQL Server -> [insert your instance name here] -> MSSearch -> Language -> [insert your language abbreviation here]

In my case, my instance was named MSSQL.1. I live in the US, so my language is “enu”, which stands for English, US. Don’t get this confused with “eng”, that is for our friends in the United Kingdom. (Unless of course that’s where you live in which case do indeed pick eng). There are around 17 supported languages out of the box.

OK, once you’ve navigated to the right spot in the registry tree, look for the key named TsaurusFile. Mine is listed as being located at:

C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\FTData\tsenu.xml

In SQL Server 2008 you have to go to two places. First you need to find the path. For that, again open up regedit and navigate to:

HKEY_LOCAL_MACHINE -> SOFTWARE -> Microsoft -> Microsoft SQL Server -> [insert your instance name here] -> MSSQLServer

For me, the instance name was MSSQL10.MSSQLSERVER. The name you should look for is “FullTextDefaultPath”. The structure was very similar to the 2005 version:

C:\Program Files\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQL\FTData\

Now to find the file name. Jump up the registry tree a little to:

HKEY_LOCAL_MACHINE -> SOFTWARE -> Microsoft -> Microsoft SQL Server -> [insert your instance name here] -> MSSearch -> Language -> [insert your language abbreviation here]

Just like with 2005, use ENU if you are in the US, or your appropriate language abbreviation. Looking at the name “TsaurusFile”, it probably comes as no surprise to discover the name is “tsenu.xml”.

Now that we’ve located it, let’s navigate to the folder and open it up in your favorite text editor (which if you’ve been reading my past blog posts is ANYTHING other than Notepad). Let’s take a look:

<XML ID=”Microsoft Search Thesaurus”>

<!–  Commented out

    <thesaurus xmlns=”x-schema:tsSchema.xml”>

    <diacritics_sensitive>0</diacritics_sensitive>

        <expansion>

            <sub>Internet Explorer</sub>

            <sub>IE</sub>

            <sub>IE5</sub>

        </expansion>

        <replacement>

            <pat>NT5</pat>

            <pat>W2K</pat>

            <sub>Windows 2000</sub>

        </replacement>

        <expansion>

            <sub>run</sub>

            <sub>jog</sub>

        </expansion>

    </thesaurus>

–>

</XML>

Note there is one minor difference between the 2005 version, shown above, and the 2008 version. In 2008, the <diacritics_sensitive> line is changed to:

       <diacritics = false/>

In either case, you won’t need to adjust this line. One other note, if you are editing this in Vista, be sure to launch your text editor in Run As Administrator mode so you can save the changes.

The first thing to notice is the file is commented out. We’ll want to uncomment as the first step. Next you’ll see the <thesaurus…> section. We won’t need to change it.

The next three sections are examples. Ultimately you’ll delete and replace with your own, but let’s take a moment to look at what is there. The first section is an <expansion>…</expansion> tag. With an expansion, all terms are equivalent. With the expansion tags, if the user enters any one of those terms, it’s the same as if they’d entered all of the terms. Thus in the first example, if a user were to type in “Internet Explorer”, a full text search would return all records that contained “Internet Explorer”, “IE”, or “IE5”.

Replacements are something you’ll use less often. With Replacements, SQL Server does not look for the word in the <pat> (pattern) tag, instead it looks for the word in the <sub> tag. In this case if a user types in W2K, the full text search engine will instead look for “Windows 2000”. The only true life situation I can think of where this would be useful is addresses. If you know your system converts all street or state abbreviations to their full expanded name, then you could use this as a trap of sort with full text searching. For example:

<replacement>

  <pat>St.</pat>

  <pat>Str.</pat>

  <sub>Street</sub>

</replacement>

Thus a user typing in Elm St. would be able to find Elm Street in your system. You could also expand this with states. Let me reiterate this example assumes your source system automatically replaces all abbreviations with their full words. Using a replacement saves the full text search engine for looking for words that won’t be there. Of the two, expansions will probably be the one you use most of the time.

So let’s create a simple example to test out our custom thesaurus. Open up your tsenu.xml file, and let’s change it to:

<XML ID=”Microsoft Search Thesaurus”>

    <thesaurus xmlns=”x-schema:tsSchema.xml”>

            <diacritics_sensitive>0</diacritics_sensitive>

        <expansion>

            <sub>light</sub>

            <sub>TheDoctor</sub>

        </expansion>

    </thesaurus>

</XML>

(Again, in 2008 the diacritics line will be slightly different, just use what 2008 put out there for you.) Here I’m going to make the word light and the word “TheDoctor” substitutable for each other. (I’m listening to the Doctor Who soundtrack as I type this, in case you were wondering. I wanted a word I knew I wouldn’t find in the Adventure Works database.) Save the changes.

Unfortunately just saving the changes is not enough to make the full text search engine pick up the changes to our thesaurus file. We have to tell SQL Server the file has been updated. The procedure differs greatly, depending on whether you run SQL Server 2005 or 2008.

In SQL Server 2005, you unfortunately must restart the entire full text search service. In a production environment, you never want to do this when the engine is busy or in use. I have a recent blog post that talks about the SQL Server Full Text Catalog Property function. In it, I have this handy query for you to be able to tell what the full text service is doing.

select case FullTextCatalogProperty(‘AdvWorksCatalog’, ‘PopulateStatus’)

          when 0 then ‘Idle’

          when 1 then ‘Full population in progress’

          when 2 then ‘Paused’

          when 3 then ‘Throttled’

          when 4 then ‘Recovering’

          when 5 then ‘Shutdown’

          when 6 then ‘Incremental population in progress’

          when 7 then ‘Building index’

          when 8 then ‘Disk is full. Paused.’

          when 9 then ‘Change tracking’

          else ‘Error reading FullTextCatalogProperty PopulateStatus’

        end

As long as it returns “Idle” you are clear to reset. Even so, I would make sure to do this during a database outage, especially with a production database. You’ve been warned!

To reset, go to Start, Control Panel, Administrative Tools, Services. Scroll down to SQL Server Full Text Search (MSSQLSERVER), right click and Restart. That should get it for SQL Server 2005.

In SQL Server 2008 it’s a good deal simpler. Make sure you are in the right database for your catalog (in this case AdventureWorks) and issue the command:

exec sys.sp_fulltext_load_thesaurus_file 1033;

go

The 1033 on the end refers to the local identifier (LCID) for the language of your thesaurus file. 1033 is for English, US. To discover your LCID, use this simple query:

select [name], alias, lcid from master.sys.syslanguages

OK, we’re at the finish line. Assuming your full text engine has restarted, it should have picked up your new, customized thesaurus file. Now we can alter our original query to:

select [Name], ProductNumber, [Description]

from [Production].[Product] p

   , [Production].[ProductDescription] pd

   , [Production].[ProductModelProductDescriptionCulture] pmpdc

where p.ProductModelID = pmpdc.ProductModelID

  and pmpdc.ProductDescriptionID = pd.ProductDescriptionID

  and CONTAINS(pd.[Description], ‘FORMSOF(Thesaurus, TheDoctor)’ )

And we should get back the same results as the original query.

Please note that all remarks for SQL Server 2008 are based upon the February 2008 CTP 6 release. As future release candidates and final editions are released, some of the information could change. Be sure to check books on line for the latest info.

A final thing to note, if you read yesterday’s post you will note that the FullText keyword uses the thesaurus as part of it’s operation. If you have a custom thesaurus, it’s entries will also be used by FullText.

As you can see, adding custom entries to the SQL Server Full Text Search Thesaurus is not hard, there’s just some steps you need to follow in order to make it happen. Once you know how, you can use the functionality to make searches for your users more productive.

Using FormsOf in SQL Server Full Text Searching

In the past I’ve talked about some advanced text searching techniques for SQL Server Full Text Searching. Another you can take advantage of is FormsOf. Let’s say we’ve setup a full text search index on the Adventure Works ProductDescription tables’ Description field. (For more info on how to do this, go to the Arcane Lessons page of this site and scroll down to the “Getting Started With SQL Server 2005 Full Text Searching” section.)

FormsOf has two ways to use it, the Inflectional mode and the Thesaurus mode. Let’s look at an example, and then I’ll explain the differences. For this example, we’ll combine data from several tables to get back some meaningful information from the AdventureWorks database.

– Example 1 – FORMSOF INFLECTION

 

select [Name], ProductNumber, [Description]

  from [Production].[Product] p

    , [Production].[ProductDescription] pd

    , [Production].[ProductModelProductDescriptionCulture] pmpdc

 where p.ProductModelID = pmpdc.ProductModelID

   and pmpdc.ProductDescriptionID = pd.ProductDescriptionID

   and CONTAINS(pd.[Description], ‘FORMSOF(Inflectional, light)’ )

 

– Example 2 – FORMSOF THESAURUS

 

select [Name], ProductNumber, [Description]

  from [Production].[Product] p

    , [Production].[ProductDescription] pd

    , [Production].[ProductModelProductDescriptionCulture] pmpdc

 where p.ProductModelID = pmpdc.ProductModelID

   and pmpdc.ProductDescriptionID = pd.ProductDescriptionID

   and CONTAINS(pd.[Description], ‘FORMSOF(Thesaurus, light)’ )

Both examples use the same, I admit, rather bizarre syntax. Inside the string you pass to the CONTAINS clause, you put FORMSOF, with the word Inflectional or Thesaurus, a comma, then the word or phrase you want to search form. In this example I use Contains, but FormsOf also works with FreeText as well.

So what is the difference between Inflectional and Thesaurus? Inflectional finds all of the tenses of a word. For example, if you passed in Start, Inflectional will find Start, Started, and Starting. For nouns, Inflectional finds the single, plural, and possessive forms.

Thesaurus works like you would expect a thesaurus to. It will find variations of a word, Start, Begin, etc. Essentially, words that have the same meaning.

How do these relate to Contains and FreeText? Normally Contains looks for an exact match. FullText matches the meaning, but not the exact words in the query. FullText is very similar to Thesaurus, and even uses the Thesaurus in its work. FullText will break out the search string into its individual words, if there is more than one word. For each word, it generates the inflectional forms of the word, then identifies a list of matches for each word based on the thesaurus. FormsOf(Thesaurus… on the other hand just uses the thesaurus to do the search, without going through the inflectional step.

Using a combination of Contains, FreeText, and FormsOf you can give your users some real flexibility, ranging from exact matches to wide open searches.

I’m Speaking At the TechEd ]InBetwen[ SQL Saturday Conference

TechEd is Microsoft’s annual developer conference, the really big one. This year it returns to the Orlando Convention Center, only this year they have decided to split it into two weeks. The first week is for developers, the second week is for the IT Professionals. My manager is generously sending two of us this year, my co-worker will be there the first week, I’ll be attending the second week to focus on the SQL Server information.

The two week split left Microsoft in an odd position, what to do with the convention center over the weekend? In their long standing tradition of working closely with the developer community, Microsoft turned the place over to the Florida user groups. They are hosting the first “]InBetween[“ conference. There is an incredible amount of content being offered, for free: .Net Code Camps, Day of Agile, Day of Silverlight, .Net University, DotNetNuke University, Exam Crams, IT Pro Camps, Office Communication Server, The ToolShed, Train the Trainer, VSTS University, and SQL Saturday and SQL University (on Sunday).

You can find out more, including links and a complete schedule at:

http://www.devfish.net/articles/inbetween/

Of special interest to me though, and the reason for this post is SQL Saturday. Either through divine intervention or a cosmic prank, I will be presenting at the SQL Saturday doing a session on Full Text Searching. You can register and get more info here:

http://www.sqlsaturday.com/eventhome.aspx?eventid=5

And see the full schedule here:

http://www.sqlsaturday.com/schedule.aspx

It’s quite an honor, there are some big names speaking there such as Andy Warren and Brian Knight. I also see that another Birminghamian, Barry Ralston will be speaking.

So if you will be in Orlando for TechEd, plan on staying late or arriving early and attend one of the many InBetween conferences. This promises to be a great event, and best of all it’s FREE. That’s right, you don’t even have to be a TechEd attendee to come, just show up and pick your event. (Of course, the event organizers would appreciate it if you would register!)

Happy Towel Day!

Wanted to wish everyone a Happy Towel Day!

If you are not familiar with the concept, it’s a tribute to the late Douglas Adams, who passed away May 11th, 2001. In addition to being a prolific author and creator of the Hitch Hiker’s Guide to the Galaxy, Adams was a big technology enthusiast. He helped create several video games, and was a big promoter of hypertext.

To Quote the Hitchhiker’s Guide…

A towel, it says, is about the most massively useful thing an interstellar hitch hiker can have. Partly it has great practical value – you can wrap it around you for warmth as you bound across the cold moons of Jaglan Beta; you can lie on it on the brilliant marble-sanded beaches of Santraginus V, inhaling the heady sea vapours; you can sleep under it beneath the stars which shine so redly on the desert world of Kakrafoon; use it to sail a mini raft down the slow heavy river Moth; wet it for use in hand-to-hand-combat; wrap it round your head to ward off noxious fumes or to avoid the gaze of the Ravenous Bugblatter Beast of Traal (a mindboggingly stupid animal, it assumes that if you can’t see it, it can’t see you – daft as a bush, but very, very ravenous); you can wave your towel in emergencies as a distress signal, and of course dry yourself off with it if it still seems to be clean enough.

More importantly, a towel has immense psychological value. For some reason, if a strag (strag: non-hitch hiker) discovers that a hitch hiker has his towel with him, he will automatically assume that he is also in possession of a toothbrush, face flannel, soap, tin of biscuits, flask, compass, map, ball of string, gnat spray, wet weather gear, space suit etc., etc. Furthermore, the strag will then happily lend the hitch hiker any of these or a dozen other items that the hitch hiker might accidentally have “lost”. What the strag will think is that any man who can hitch the length and breadth of the galaxy, rough it, slum it, struggle against terrible odds, win through, and still knows where his towel is is clearly a man to be reckoned with.

I’ve got my towel, how about you?

A Tale of Excellent Customer Service

In this day and age, we hear a lot about bad customer service. Complaints, bad reviews, negative publicity abound on the internet. So for a change, I thought I’d share a tale of excellent customer service.

Last night I tried to install UltraEdit. In my last post I mentioned having bought a new hard disk, and am still getting everything set back up. As part of UltraEdit’s setup, I had to enter my authorization key. For some reason, I couldn’t get the software to accept my key, event though I knew it was valid.

So this morning I e-mailed their customer support address. Less than 15 minutes later I get a response. I figure it’s one of those automated “we got your message, will reply in 24-48 hours” like most companies send. But no! This was an e-mail from a real live person named Renѐe.

The problem was a conflict with Vista SP1’s security, under non-admin mode it wouldn’t let UltraEdit write whatever it needed to, to save my registration info. All I had to do was right click on the uedit32.exe and Run As Admin. That time it took my authorization and saved it correctly.

After getting it fixed I let Renѐe know it had fixed the problem, and even got a “Thank You” e-mail for letting them know. Wow! Within 20 minutes of sending in my support request my problem was fixed and I was a happy camper.

Kudos to IDM Computer Solutions for a great product and even better customer support, and thanks to Renѐe for such a quick and courteous experience.

Follow

Get every new post delivered to your Inbox.

Join 103 other followers