Creating Custom Thesaurus Entries in SQL Server 2005 and 2008 Full Text Search

SQL Server Full Text Search has the ability to search not just for a word, but for various forms of a word based upon the built in thesaurus. To use this ability in your own full text searches, you simply use the FormsOf clause. Here’s an example using the AdventureWorks database, with a full text index on the product description table’s Description field.

select [Name], ProductNumber, [Description]

from [Production].[Product] p

   , [Production].[ProductDescription] pd

   , [Production].[ProductModelProductDescriptionCulture] pmpdc

where p.ProductModelID = pmpdc.ProductModelID

  and pmpdc.ProductDescriptionID = pd.ProductDescriptionID

  and CONTAINS(pd.[Description], ‘FORMSOF(Thesaurus, light)’ )

One of the biggest questions I get asked is, how do we customize the Thesaurus file used by SQL Server? Today we’ll look at how to do just that.

The first step is tracking it down. SQL Server stores the location of the thesaurus in the registry. The location depends on whether you are in SQL Server 2005 or 2008. For SQL Server 2005, open up regedit or your favorite editor and navigate to:

HKEY_LOCAL_MACHINE -> SOFTWARE -> Microsoft -> Microsoft SQL Server -> [insert your instance name here] -> MSSearch -> Language -> [insert your language abbreviation here]

In my case, my instance was named MSSQL.1. I live in the US, so my language is “enu”, which stands for English, US. Don’t get this confused with “eng”, that is for our friends in the United Kingdom. (Unless of course that’s where you live in which case do indeed pick eng). There are around 17 supported languages out of the box.

OK, once you’ve navigated to the right spot in the registry tree, look for the key named TsaurusFile. Mine is listed as being located at:

C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\FTData\tsenu.xml

In SQL Server 2008 you have to go to two places. First you need to find the path. For that, again open up regedit and navigate to:

HKEY_LOCAL_MACHINE -> SOFTWARE -> Microsoft -> Microsoft SQL Server -> [insert your instance name here] -> MSSQLServer

For me, the instance name was MSSQL10.MSSQLSERVER. The name you should look for is “FullTextDefaultPath”. The structure was very similar to the 2005 version:

C:\Program Files\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQL\FTData\

Now to find the file name. Jump up the registry tree a little to:

HKEY_LOCAL_MACHINE -> SOFTWARE -> Microsoft -> Microsoft SQL Server -> [insert your instance name here] -> MSSearch -> Language -> [insert your language abbreviation here]

Just like with 2005, use ENU if you are in the US, or your appropriate language abbreviation. Looking at the name “TsaurusFile”, it probably comes as no surprise to discover the name is “tsenu.xml”.

Now that we’ve located it, let’s navigate to the folder and open it up in your favorite text editor (which if you’ve been reading my past blog posts is ANYTHING other than Notepad). Let’s take a look:

<XML ID=”Microsoft Search Thesaurus”>

<!–  Commented out

    <thesaurus xmlns=”x-schema:tsSchema.xml”>

    <diacritics_sensitive>0</diacritics_sensitive>

        <expansion>

            <sub>Internet Explorer</sub>

            <sub>IE</sub>

            <sub>IE5</sub>

        </expansion>

        <replacement>

            <pat>NT5</pat>

            <pat>W2K</pat>

            <sub>Windows 2000</sub>

        </replacement>

        <expansion>

            <sub>run</sub>

            <sub>jog</sub>

        </expansion>

    </thesaurus>

–>

</XML>

Note there is one minor difference between the 2005 version, shown above, and the 2008 version. In 2008, the <diacritics_sensitive> line is changed to:

       <diacritics = false/>

In either case, you won’t need to adjust this line. One other note, if you are editing this in Vista, be sure to launch your text editor in Run As Administrator mode so you can save the changes.

The first thing to notice is the file is commented out. We’ll want to uncomment as the first step. Next you’ll see the <thesaurus…> section. We won’t need to change it.

The next three sections are examples. Ultimately you’ll delete and replace with your own, but let’s take a moment to look at what is there. The first section is an <expansion>…</expansion> tag. With an expansion, all terms are equivalent. With the expansion tags, if the user enters any one of those terms, it’s the same as if they’d entered all of the terms. Thus in the first example, if a user were to type in “Internet Explorer”, a full text search would return all records that contained “Internet Explorer”, “IE”, or “IE5”.

Replacements are something you’ll use less often. With Replacements, SQL Server does not look for the word in the <pat> (pattern) tag, instead it looks for the word in the <sub> tag. In this case if a user types in W2K, the full text search engine will instead look for “Windows 2000”. The only true life situation I can think of where this would be useful is addresses. If you know your system converts all street or state abbreviations to their full expanded name, then you could use this as a trap of sort with full text searching. For example:

<replacement>

  <pat>St.</pat>

  <pat>Str.</pat>

  <sub>Street</sub>

</replacement>

Thus a user typing in Elm St. would be able to find Elm Street in your system. You could also expand this with states. Let me reiterate this example assumes your source system automatically replaces all abbreviations with their full words. Using a replacement saves the full text search engine for looking for words that won’t be there. Of the two, expansions will probably be the one you use most of the time.

So let’s create a simple example to test out our custom thesaurus. Open up your tsenu.xml file, and let’s change it to:

<XML ID=”Microsoft Search Thesaurus”>

    <thesaurus xmlns=”x-schema:tsSchema.xml”>

            <diacritics_sensitive>0</diacritics_sensitive>

        <expansion>

            <sub>light</sub>

            <sub>TheDoctor</sub>

        </expansion>

    </thesaurus>

</XML>

(Again, in 2008 the diacritics line will be slightly different, just use what 2008 put out there for you.) Here I’m going to make the word light and the word “TheDoctor” substitutable for each other. (I’m listening to the Doctor Who soundtrack as I type this, in case you were wondering. I wanted a word I knew I wouldn’t find in the Adventure Works database.) Save the changes.

Unfortunately just saving the changes is not enough to make the full text search engine pick up the changes to our thesaurus file. We have to tell SQL Server the file has been updated. The procedure differs greatly, depending on whether you run SQL Server 2005 or 2008.

In SQL Server 2005, you unfortunately must restart the entire full text search service. In a production environment, you never want to do this when the engine is busy or in use. I have a recent blog post that talks about the SQL Server Full Text Catalog Property function. In it, I have this handy query for you to be able to tell what the full text service is doing.

select case FullTextCatalogProperty(‘AdvWorksCatalog’, ‘PopulateStatus’)

          when 0 then ‘Idle’

          when 1 then ‘Full population in progress’

          when 2 then ‘Paused’

          when 3 then ‘Throttled’

          when 4 then ‘Recovering’

          when 5 then ‘Shutdown’

          when 6 then ‘Incremental population in progress’

          when 7 then ‘Building index’

          when 8 then ‘Disk is full. Paused.’

          when 9 then ‘Change tracking’

          else ‘Error reading FullTextCatalogProperty PopulateStatus’

        end

As long as it returns “Idle” you are clear to reset. Even so, I would make sure to do this during a database outage, especially with a production database. You’ve been warned!

To reset, go to Start, Control Panel, Administrative Tools, Services. Scroll down to SQL Server Full Text Search (MSSQLSERVER), right click and Restart. That should get it for SQL Server 2005.

In SQL Server 2008 it’s a good deal simpler. Make sure you are in the right database for your catalog (in this case AdventureWorks) and issue the command:

exec sys.sp_fulltext_load_thesaurus_file 1033;

go

The 1033 on the end refers to the local identifier (LCID) for the language of your thesaurus file. 1033 is for English, US. To discover your LCID, use this simple query:

select [name], alias, lcid from master.sys.syslanguages

OK, we’re at the finish line. Assuming your full text engine has restarted, it should have picked up your new, customized thesaurus file. Now we can alter our original query to:

select [Name], ProductNumber, [Description]

from [Production].[Product] p

   , [Production].[ProductDescription] pd

   , [Production].[ProductModelProductDescriptionCulture] pmpdc

where p.ProductModelID = pmpdc.ProductModelID

  and pmpdc.ProductDescriptionID = pd.ProductDescriptionID

  and CONTAINS(pd.[Description], ‘FORMSOF(Thesaurus, TheDoctor)’ )

And we should get back the same results as the original query.

Please note that all remarks for SQL Server 2008 are based upon the February 2008 CTP 6 release. As future release candidates and final editions are released, some of the information could change. Be sure to check books on line for the latest info.

A final thing to note, if you read yesterday’s post you will note that the FullText keyword uses the thesaurus as part of it’s operation. If you have a custom thesaurus, it’s entries will also be used by FullText.

As you can see, adding custom entries to the SQL Server Full Text Search Thesaurus is not hard, there’s just some steps you need to follow in order to make it happen. Once you know how, you can use the functionality to make searches for your users more productive.

Advertisements