SQL Server Full Text Searching at SQL Saturday #22 Pensacola FL

Today I had the opportunity to hang out with a bunch of cool people at SQL Saturday 22 in Pensacola FL. My presentation is SQL Server Full Text Searching, a Guide for Dev’s and DBAs. While I’ve done this presentation in the past, it’s been updated with new material and demos. You’ll find all the code, and the slide deck at the Code Gallery site,

http://code.msdn.microsoft.com/SqlServerFTS

Thanks for coming!

Alabama Code Camp Mobile 2010

Last Saturday was the Alabama Code Camp, held in Mobile AL. For those unfamiliar with the Alabama Code Camps, we hold on average two a year, and they shift from city to city with different user groups acting as the host group. Other cities include Huntsville, Birmingham, and Montgomery. This time though the Lower Alabama Dot Net User Group under the leadership of Ryan Duclos hosted, and what a great event it was. Everything ran smoothly, there was plenty of drinks and pizza to go around, and some good swag to boot. A big congrats to Ryan and his team of volunteers for a great event, also thanks to Microsoft for sponsoring and the University of South Alabama for the venue.

I was kept busy at this code camp, doing three sessions. The first session was “Introduction to Microsoft PowerPivot”. The slide deck can be found at https://arcanecode.files.wordpress.com/2010/01/powerpivot_long.pdf. To see all my PowerPivot posts, simply pick it in the categories to the right or use this link: https://arcanecode.com/category/powerpivot/.

My second session was on Full Text Searching. You can find code samples and the PDF for the presentation at my code gallery site, http://code.msdn.microsoft.com/SqlServerFTS.

The final presentation was an introduction to Business Intelligence and Data Warehousing. Here is the link to the presentations slides in PDF format. As promised in the session I added the additional information for the Kimball Group book.

A quick apology for my delay in posting, a nasty head cold has had me in Zombie land since I got back. Thanks to all who attended, I appreciate you being very interactive, lots of questions, and very attentive. I look forward to the next time Mobile hosts the Alabama Code Camp.

SQL Saturday Redmond – October 3 2009

I am fortunate enough to be able to give three presentations at Redmond WA’s SQL Saturday event. The first session is “Introduciton to Data Warehousing / Business Intelligence”. Here is the PDF slide deck for that presentation. (Right click and save as if you want to save a copy for later reference).

The second presentation is SQL Server Full Text Searching. You can find the slide deck in PDF format as well as sample code at http://code.msdn.microsoft.com/SqlServerFTS.

The final presentation of the day was Introduction to SQL Server Integration Services. The sample project, slide deck, and step by step instructions can be found at http://code.msdn.microsoft.com/introssis . In addition I also showed how to call SSIS from a .Net application. You can find that sample at http://code.msdn.microsoft.com/ssisfromnet .

Full Text Searching a FILESTREAM VARBINARY (MAX) Column

In the past I’ve written that Full Text Searching has the ability to index documents stored in a VARBINARY(MAX) field. However, I have never really gone into any details on how to do this. Today I will remedy that by demonstrating how to Full Text Seach not only using a VARBINARY(MAX) field, but one that has been stored using FILESTREAM. Even though these examples will be done against the data we’ve stored with FILESTREAM over the lessons from the last few days, know that this technique is identical for binary objects stored in a VARBINARY(MAX) field without using FILESTREAM.

Let’s start by creating a catalog to hold our Full Text data.


CREATE FULLTEXT CATALOG FileStreamFTSCatalog AS DEFAULT;

Pretty normal, now we need to create a full text index on the “DocumentRepository” table we created in this series. When you look at the syntax though, you may notice a minor difference from the CREATE FULLTEXT INDEX examples I’ve shown in the past:


CREATE FULLTEXT INDEX ON dbo.DocumentRepository
(DocumentName, Document TYPE COLUMN DocumentExtension)
KEY INDEX PK__Document__3214EC277F60ED59
ON FileStreamFTSCatalog
WITH CHANGE_TRACKING AUTO;

Here you can see I am indexing two fields. The first is the “DocumentName”, which is passed in as the first parameter and looks like other examples. We won’t actually be using it in this example, however I included it to demonstrate you can index multiple columns even when one of them is a VARBINARY(MAX) column.

The second parameter indexes the VARBINARY(MAX) “Document” column itself, but notice the TYPE COLUMN after the column name. In order to Full Text Index a VARBINARY(MAX) column you must also have a column with the file extension in it. You then pass in the name of column after the TYPE COLUMN. In this example, the document extension is stored in the “DocumentExtension” column. Since the document extension can be stored in a column with any name, we let the Full Text engine know which column by passing it in after the TYPE COLUMN keyword. The remainder of the command is like other examples I’ve shown in the past.

Now we can run a normal SELECT…CONTAINS query against the “Document” field.


SELECT ID, DocumentName 
FROM dbo.DocumentRepository
WHERE CONTAINS(Document, 'Shrew');

I’ll leave it to you to run, for me it returned one row, with “TheTamingOfTheShrew.doc”. If you want to try it again, use “Elinor”, and you should get back “KingJohn.doc”.

As you can see, performing a Full Text Search against a VARBINARY(MAX) column is quite easy, all you have to do is indicate the document type by using the TYPE COLUMN. There are two more things you should know. First, the column containing the document extension must be of type CHAR, NCHAR, VARCHAR, or NVARCHAR. Second, the document type must be recognized by SQL Server. To get a list of all valid document types, simply query the fulltext_document_types catalog view like so:


SELECT * FROM sys.fulltext_document_types;

This will give you a list of all file extensions understood by SQL Server. Each row actually represents a filter. Each filter represents a DLL that implements the IFilter interface. It is possible to add additional filters to the system. For example, Microsoft offers the “Microsoft Filter Pack”. You may have noticed that out of the box SQL Server 2008 supports the older Office 2003 documents, but not the more recent Office 2007 formats. To add these newer formats to your SQL Server, Microsoft provides the afore mentioned filter pack. While installing it is beyond the scope of this aritcle you can find complete instructions for downloand and installation at http://support.microsoft.com/default.aspx?scid=kb;en-us;945934 .

The Full Text Search features provided by SQL Server continue to amaze me with how powerful they are, yet how easy they are to implment. With the information here you can easily search through documents stored in a VARBINARY(MAX) field, even when those documents are actually stored via the new SQL Server 2008 FILESTREAM.

Gentleman, JumpstartTV Your Engines

Thought I’d spread a little link love today, and to start with I will point you to the http://jumpstarttv.com website. JumpstartTV hosts short training videos with one very specific, focused topic per video. When I say short, I mean short. Three to five minutes is the goal for each video. I was honored recently when asked to participate in the site, and have created a series for them on SQL Server Full Text Searching. The first video on installing was featured yesterday, but you don’t have to wait for the videos to be featured, you can see all of them by jumping to my JumpstartTV profile.

One thing to note, you will be asked to create an online profile. This is free, and it turns out very useful. You can use it to track all of the videos you watched. This makes it very convenient to come back later and refresh yourself on something you learned. In addition, the site has a “watch it later” feature. You can go all over the site picking out videos you think would be interesting and clicking the “watch it later” link. Then when you go to your profile, you’ll be able watch the selected videos one after the other. JumpstartTV has videos on both SQL Server and .Net, as well as some interesting ones in the “Misc” category, including bartending, self defense, and more.

The second link for the day is an interesting article on the simple-talk website, “Taking Back Control of your IT Career”. It was written by a friend of mine, Stephan Onisick and chronicles his ordeal of getting laid off from his company of seven years, through a period of retraing himself and ultimately landing a new job that met the needs he set out. Even if your company is nice and stable, you will find good advice for keeping your skills up in this article. Disclaimer, he does mention a presentation I did in the article, but in spite of that it’s still a good read. 😉

Next is a new SQL Server resource brought to us by the fine folks at Quest Software, it’s the new SQLServerPedia. The site is both a wiki and a series of podcast like videos you can subscribe to from your Zune or other music player. I have my Zune setup to automagically download new episodes as they come out. I believe it was @BrentO himself who clued me in on the site.

I’ve written in the past about CodeRush, the tool I refuse to code without. Well the wonderful folks at Devexpress have created a free version called CodeRush Xpress for Visual Studio. Now if you need to code on a budget, you can still enjoy CodeRushy goodness in your 2008 IDE! And it’s not even Christmas yet!

Many of you follow me on Twitter, if you don’t I’d love to invite you, I”m on as @arcanecode . Guy Kawasaki has a great article on How To Pick Up Followers on Twitter. Good article that shows some of the strengths of Twitter, and how to use them to everyone’s advantage.

Speaking of Twitter, thanks to @theronkelso I found a new service called TweetLater. This service lets you schedule a tweet to be delivered to Twitter at a later time. For example, I would like to be able to tweet that our BSDA meeting is about to begin. But as the current President I’m usually up front introducing the guest speaker, and thus not at a keyboard. TweetLater to the rescue, I can set it to auto post the meeting is starting and be in two places at once.

It’s also great as a reminder tool, I can queue up meeting reminder tweets for the entire year ahead of time and forget all about it. Another feature, you can set it to auto reply with a message to new followers, and it can even be setup to automatically follow anyone who is following you. I believe this is a resource I’ll be using a lot.

The next to final link is a reminder really, to the Alabama Tech Events site. This is a community site for posting technical events of interest to folks in the state of Alabama. Please note that the event doesn’t have to be in Alabama, just of reasonable interest to folks in the state. We’ve posted events in Tennesee, Mississippi, Florida and Georgia. If you have a technical event contact me or one of the other user group leaders to get it added.

I’ll wrap up today’s link lovefest with the site analogous to the Alabama Tech Event site, but for the entire country: Community Megaphone. This site lists events from all over the United States. You can filter by state or event type.

Presenting SQL Server 2005 2008 Full Text Searching at the TechEd 2008 InBetween conference

On June 7th, 2008 I am presenting “Getting Started with SQL Server 2005/2008” at the InBetween SQL Saturday conference that is taking place between the two weeks of Microsoft TechEd. This post will have all the links relevant to my talk.

Detailed descriptive material can be found on my blog, ArcaneCode.com, with links to specific posts below.

First off, the slides and sample code can be located at the Code Gallery site I setup specifically for Full Text Searching with SQL Server:

http://code.msdn.microsoft.com/SqlServerFTS

Look on the downloads page to see various projects around SQL Server Full Text Searching. I’ve created one “release” for each of the projects around FTS. Be sure to look on the right side at the various releases in order to see the various projects.

Next, you can get started with the basics by reading these entries on my blog:

Lesson 0 – Getting the Bits to do Full Text Searching in SQL Server 2005
Lesson 1 – The Catalog
Lesson 2 – The Indexes
Lesson 3 – Using SQL
Lesson 4 – Valid Data Types
Lesson 5 – Advanced Searching

After that you’ll be ready for some advanced topics.

Can you hear me now? Checking to see if FTS is installed.
Exploring SQL Servers FullTextCatalogProperty Function
Using the ObjectPropertyEx Function
Using FORMSOF in SQL Server Full Text Searching
Creating Custom Thesaurus Entries in SQL Server 2005 and 2008 Full Text Search
Creating and Customizing Noise Words in SQL Server 2005 Full Text Search
Creating and Customizing Noise Words / StopWords in SQL Server 2008 Full Text Search
Advanced Queries for Using SQL Server 2008 Full Text Search StopWords / StopLists

Advanced Queries for Using SQL Server 2008 Full Text Search StopWords / StopLists

Yesterday I covered the basics around StopWords/StopLists in SQL Server 2008. Today we’ll look at some advanced queries you can use to get more information back about FullText Indexes, StopWords, and StopLists.

The first query simply returns a list of all of the catalogs in the system.

— Get current list of full text catalogs

select [name] as CatalogName

    , path

    , is_default

  from sys.fulltext_catalogs

 order by [name];

 

The next query returns a list of all the StopLists.

 — Get the list of StopLists

 select stoplist_id

      , name

   from sys.fulltext_stoplists;

 

This query returns a list of StopWords in the database. Note the linking to get the associated StopList name and language.

 — Get list of StopWords

 select sl.name as StopListName

      , sw.stopword as StopWord

      , lg.alias as LanguageAlias

      , lg.name  as LanguageName

      , lg.lcid  as LanguageLCID

   from sys.fulltext_stopwords sw

   join sys.fulltext_stoplists sl

    on sl.stoplist_id = sw.stoplist_id

   join master.sys.syslanguages lg

    on lg.lcid = sw.language_id;

 

This next query gets a list of all of the stopwords that ship with SQL Server 2008. This is a nice improvement, you can not do this in SQL Server 2005.

— Get a list of the System provided stopwords  

select ssw.stopword

    , slg.name

  from sys.fulltext_system_stopwords ssw

  join sys.fulltext_languages slg

    on slg.lcid = ssw.language_id;

 

My next query returns a list of all the Full Text Indexes in the database.

— List full text indexes

select c.name as CatalogName

    , t.name as TableName

    , idx.name as UniqueIndexName

    , case i.is_enabled

        when 1 then ‘Enabled’

        else ‘Not Enabled’

       end as IsEnabled

    , i.change_tracking_state_desc

    , sl.name as StopListName

  from sys.fulltext_indexes i

  join sys.fulltext_catalogs c

    on i.fulltext_catalog_id = c.fulltext_catalog_id

  join sys.tables t

    on i.object_id = t.object_id

  join sys.indexes idx

    on i.unique_index_id = idx.index_id

       and i.object_id = idx.object_id

  left join sys.fulltext_stoplists sl

    on sl.stoplist_id = i.stoplist_id

 

 

This query returns a list of all the document types SQL Server 2008 understands when they are placed in a varbinary(max) field.

— List all of the document types SQL Server 2008 will understand in varbinary(max) field

select document_type

    , path

    , [version]

    , manufacturer

  from sys.fulltext_document_types;

 

If your full text performance begins to suffer over time, you might want to check and see how many fragments exist. If you have multiple closed fragments, you should consider doing a REORGANIZE on the index (using alter fulltext index). This query will tell you how many fragments exist for your full text index.

— See how many fragments exist for each full text index.

— If multiple closed fragments exist for a table do a REORGANIZE to help performance

select t.name as TableName

    , f.data_size

    , f.row_count

    , case f.status

        when 0 then ‘Newly created and not yet used’

        when 1 then ‘Being used for insert’

        when 4 then ‘Closed ready for query’

        when 6 then ‘Being used for merge inpurt and ready for query’

        when 8 then ‘Marked for deletion. Will not be used for query and merge source’

        else ‘Unknown status code’

       end

  from sys.fulltext_index_fragments f

  join sys.tables t on f.table_id = t.object_id;

 

There you go, a handful of powerful queries to help you query and maintain the state of your full text indexes.

Creating and Customizing Noise Words / StopWords in SQL Server 2008 Full Text Search

While SQL Server 2005 had Noise Words for its full text search, SQL Server 2008 has moved to StopWords. The good news is the change is much deeper than just a rebranding. I say good news, because with the change comes a lot more flexibility and functionality.

There are actually two new tools introduced with SQL Server 2008, StopWords and StopList. A StopList acts as a named container for a group of StopWords. You can then associate a StopList with one or more tables. This is a great enhancement over Noise Words, which applied to the entire server. Now you can associate a group of StopWords, in a StopList, with specific tables without affecting the rest of the tables on the database or server.

First, let’s run a query to demonstrate. I’ll be doing all my testing against the AdventureWorks2008 database. If you don’t have it, you can download it at the CodePlex site. I’ve also used the standard Create Catalog command to create a full text catalog named AdvWorksCatalog. I then put a full text index on the Production.ProductDescription field. That done, here is a query I’ve used for examples in the past.

–Demo 1 – Query the Index

use AdventureWorks2008;

 

select [Name], ProductNumber, [Description]

from [Production].[Product] p

   , [Production].[ProductDescription] pd

   , [Production].[ProductModelProductDescriptionCulture] pmpdc

where p.ProductModelID = pmpdc.ProductModelID

  and pmpdc.ProductDescriptionID = pd.ProductDescriptionID

  and CONTAINS(pd.[Description], ‘shifting’)

 

Now we need to create the stoplist. Here’s we’ll use the first of the new SQL Server 2008 commands, create fulltext stoplist.

— Demo 2 – Create Stoplist

use AdventureWorks2008;

 

create fulltext stoplist ArcanesStopList;

The StopList will act as a holder for a specific set of words that we want to ignore. We refer to that group of words by the name we gave it, ArcanesStopList. Now we need to add some words to the list. Here are two ways to do so, both use the alter fulltext stoplist command.

— Demo 3

— Add StopWord

 

use AdventureWorks2008;

 

alter fulltext stoplist ArcanesStopList

  add ‘shifting’ language 1033;

 

alter fulltext stoplist ArcanesStopList

  add ‘light’ language ‘English’;

 

The command is straight forward, use the alter fulltext stoplist, give it the name of the list you want to add a word to. Then comes the word add, followed by the word you want to add. Next you have to specify the language. There are two ways to specify the language, either by using the language ID (in my case 1033) or the name for the language. If you don’t know your language, just use this query to find it:

— Here’s how to find your language  

select [name], alias, lcid from master.sys.syslanguages;  

If you were to jump the gun and rerun the query from Demo 1, you’d think it would now ignore the word shifting since we just added it as a StopWord to our StopList. But there’s still one more step.

You need to attach your StopList to a table that has a full text index on it. This is a major improvement over 2005. SQL Server 2008 now allows you to get quite granular with the application of custom groups of words. You are limited to one StopList per table, however. One StopList can be applied to multiple tables. Here’s the code to associate our StopList with a table:

— Demo 4 – Add the StopList to the Full Text Index

use AdventureWorks2008;

 

alter fulltext index on [Production].[ProductDescription]

  set stoplist ArcanesStopList;

The alter fulltext command has been modified to add a set stoplist set of keywords. All we need to do is specify the tablename and the StopList to associate with that table. Now go run the query from Demo 1. You should get back zero rows.

Congratulations! You’ve now associated your StopList with the full text index..

I’m sure however you don’t want to leave it this way, so let’s look at what it will take to clean up the mess. First, you can decide you no longer want the StopList associated with the full text index. Time to use the alter command again.

— Demo 4-2 – Remove the StopList to the Full Text Index

use AdventureWorks2008;

 

alter fulltext index on [Production].[ProductDescription]

  set stoplist system;

Setting the stoplist to the keyword system will change from your custom stoplist to the system stoplist. You can also use the word off instead of system to turn off StopWords all together for the specified table. Now would be a good time to mention, if you want to use the standard system set of StopWords, instead using a custom set, use the system keyword, as you see above.

There may be times when you want to remove just a word or two from a StopList, but not disassociate the entire list. Its possible to easily remove individual words from the list.

— Demo 4-3 – Remove single word from Stoplist

use AdventureWorks2008;

 

alter fulltext stoplist ArcanesStopList

  drop ‘shifting’ language 1033;

 

alter fulltext stoplist ArcanesStopList

  drop ‘light’ language ‘English’

The syntax for the drop is identical to the add, except of course for using the word drop instead of add.

Finally, you may want to drop the StopList all together. There’s a drop command for it as well.

— Demp 5-3 – Drop StopList

use AdventureWorks2008;

 

drop fulltext stoplist ArcanesStopList;

This covers the basic use of StopWords / StopLists in 2008. I believe the ability to associate a specific set of words with one or more specific tables will give your searches greater power and flexibility.

Creating and Customizing Noise Words in SQL Server 2005 Full Text Search

SQL Server 2005 Full Text Search is a powerful feature. It will allow you to search for words in your text fields and get results back in a timely manner. However, certain words are so common they get in the way of effective searching. Having these words indexed would be worthless in terms of effective searching. Words like a, an, the, or, and so on. SQL Server 2005 refers to these words as “noise words”.

The list of noise words that ships with SQL Server 2005 is fine for most users. However, there are times when it can be an advantage to add words to the list. For example, let’s say your company has a rule that it’s name must appear in the header of all internal Word documents. Further, let’s say that all of those Word documents are stored in a varbinary(max) field and full text indexed. Your company name would effectively become a noise word, searching for it would return every record in your table and thus be useless in terms of results. You would then want to add your company name to the list of noise words for SQL Server 2005 to ignore.

Or, let’s take the opposite example. There was a word in the default SQL Server 2005 noise word list that, when in all caps, was an acronym for a piece of equipment in our plants. Thus it was getting omitted from searches. We had to remove that word from the noise word list.

So, how do we change the list of noise words in SQL Server 2005? First we have to track it down. Open up regedit and navigate to this spot in the tree.

HKEY_LOCAL_MACHINE -> SOFTWARE -> Microsoft -> Microsoft SQL Server -> [insert your instance name here] -> MSSearch -> Language -> [insert your language abbreviation here]

My instance name is MSSQL.1. Pick your language, since I live in the US, my language will be enu, short for English, US. Don’t confuse it with eng, which is for our friends in Great Britain. Now look for the name NoiseFile. The value for mine is:

C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\FTData\noiseenu.txt.

Use any text editor to open and look at the file, it’s a simple text file. Mine only had 129 words in it. Now simply add or remove the words you want (or don’t want) in the noise word list. When done, save the file. If you are using SQL Server 2005 Developer Edition on Vista, make sure to open the text file with admin privileges so you’ll be able to save your changes.

Now for the bad news. (Hey, you didn’t think it was going to be that easy did you?) In order for the noise word file to take effect you must repopulate the full text indexes. You can do the entire catalog at once, or one table at a time. To do the entire catalog, use the alter full text catalog command. For more info on this command, see my post on Getting Started with SQL Server 2005 Full Text Searching: Part 1 – The Catalog.

To change things on a table by table basis, you can use the alter full text index command. Again, rather than repeating myself I’ll refer you back to my post Getting Started with SQL Server 2005 Full Text Searching: Part 2 – The Indexes. Failure to rebuild your indexes will result in bad results. As new items are indexed, or old updated, the new noise word list will apply as SQL Server 2005 applies the full text index to the record. If old records are not updated, you’ll have records that should (or shouldn’t) be in the results and some unhappy users.

Bear in mind one important fact, changes to the noise word file apply to the entire SQL Server, and all the databases running on it. Thus if you include or exclude noise words to support one app, you are affecting every app that touches every full text indexed database on your server.

Finally, if you read my post yesterday on Thesaurus Entries in SQL Server 2005 and 2008, you may wonder why no mention of 2008 and noise words? In SQL Server 2008 noise words have been replaced with a concept called stopwords. I’ll be talking about stopwords in a post in the very near future.

Creating Custom Thesaurus Entries in SQL Server 2005 and 2008 Full Text Search

SQL Server Full Text Search has the ability to search not just for a word, but for various forms of a word based upon the built in thesaurus. To use this ability in your own full text searches, you simply use the FormsOf clause. Here’s an example using the AdventureWorks database, with a full text index on the product description table’s Description field.

select [Name], ProductNumber, [Description]

from [Production].[Product] p

   , [Production].[ProductDescription] pd

   , [Production].[ProductModelProductDescriptionCulture] pmpdc

where p.ProductModelID = pmpdc.ProductModelID

  and pmpdc.ProductDescriptionID = pd.ProductDescriptionID

  and CONTAINS(pd.[Description], ‘FORMSOF(Thesaurus, light)’ )

One of the biggest questions I get asked is, how do we customize the Thesaurus file used by SQL Server? Today we’ll look at how to do just that.

The first step is tracking it down. SQL Server stores the location of the thesaurus in the registry. The location depends on whether you are in SQL Server 2005 or 2008. For SQL Server 2005, open up regedit or your favorite editor and navigate to:

HKEY_LOCAL_MACHINE -> SOFTWARE -> Microsoft -> Microsoft SQL Server -> [insert your instance name here] -> MSSearch -> Language -> [insert your language abbreviation here]

In my case, my instance was named MSSQL.1. I live in the US, so my language is “enu”, which stands for English, US. Don’t get this confused with “eng”, that is for our friends in the United Kingdom. (Unless of course that’s where you live in which case do indeed pick eng). There are around 17 supported languages out of the box.

OK, once you’ve navigated to the right spot in the registry tree, look for the key named TsaurusFile. Mine is listed as being located at:

C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\FTData\tsenu.xml

In SQL Server 2008 you have to go to two places. First you need to find the path. For that, again open up regedit and navigate to:

HKEY_LOCAL_MACHINE -> SOFTWARE -> Microsoft -> Microsoft SQL Server -> [insert your instance name here] -> MSSQLServer

For me, the instance name was MSSQL10.MSSQLSERVER. The name you should look for is “FullTextDefaultPath”. The structure was very similar to the 2005 version:

C:\Program Files\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQL\FTData\

Now to find the file name. Jump up the registry tree a little to:

HKEY_LOCAL_MACHINE -> SOFTWARE -> Microsoft -> Microsoft SQL Server -> [insert your instance name here] -> MSSearch -> Language -> [insert your language abbreviation here]

Just like with 2005, use ENU if you are in the US, or your appropriate language abbreviation. Looking at the name “TsaurusFile”, it probably comes as no surprise to discover the name is “tsenu.xml”.

Now that we’ve located it, let’s navigate to the folder and open it up in your favorite text editor (which if you’ve been reading my past blog posts is ANYTHING other than Notepad). Let’s take a look:

<XML ID=”Microsoft Search Thesaurus”>

<!–  Commented out

    <thesaurus xmlns=”x-schema:tsSchema.xml”>

    <diacritics_sensitive>0</diacritics_sensitive>

        <expansion>

            <sub>Internet Explorer</sub>

            <sub>IE</sub>

            <sub>IE5</sub>

        </expansion>

        <replacement>

            <pat>NT5</pat>

            <pat>W2K</pat>

            <sub>Windows 2000</sub>

        </replacement>

        <expansion>

            <sub>run</sub>

            <sub>jog</sub>

        </expansion>

    </thesaurus>

–>

</XML>

Note there is one minor difference between the 2005 version, shown above, and the 2008 version. In 2008, the <diacritics_sensitive> line is changed to:

       <diacritics = false/>

In either case, you won’t need to adjust this line. One other note, if you are editing this in Vista, be sure to launch your text editor in Run As Administrator mode so you can save the changes.

The first thing to notice is the file is commented out. We’ll want to uncomment as the first step. Next you’ll see the <thesaurus…> section. We won’t need to change it.

The next three sections are examples. Ultimately you’ll delete and replace with your own, but let’s take a moment to look at what is there. The first section is an <expansion>…</expansion> tag. With an expansion, all terms are equivalent. With the expansion tags, if the user enters any one of those terms, it’s the same as if they’d entered all of the terms. Thus in the first example, if a user were to type in “Internet Explorer”, a full text search would return all records that contained “Internet Explorer”, “IE”, or “IE5”.

Replacements are something you’ll use less often. With Replacements, SQL Server does not look for the word in the <pat> (pattern) tag, instead it looks for the word in the <sub> tag. In this case if a user types in W2K, the full text search engine will instead look for “Windows 2000”. The only true life situation I can think of where this would be useful is addresses. If you know your system converts all street or state abbreviations to their full expanded name, then you could use this as a trap of sort with full text searching. For example:

<replacement>

  <pat>St.</pat>

  <pat>Str.</pat>

  <sub>Street</sub>

</replacement>

Thus a user typing in Elm St. would be able to find Elm Street in your system. You could also expand this with states. Let me reiterate this example assumes your source system automatically replaces all abbreviations with their full words. Using a replacement saves the full text search engine for looking for words that won’t be there. Of the two, expansions will probably be the one you use most of the time.

So let’s create a simple example to test out our custom thesaurus. Open up your tsenu.xml file, and let’s change it to:

<XML ID=”Microsoft Search Thesaurus”>

    <thesaurus xmlns=”x-schema:tsSchema.xml”>

            <diacritics_sensitive>0</diacritics_sensitive>

        <expansion>

            <sub>light</sub>

            <sub>TheDoctor</sub>

        </expansion>

    </thesaurus>

</XML>

(Again, in 2008 the diacritics line will be slightly different, just use what 2008 put out there for you.) Here I’m going to make the word light and the word “TheDoctor” substitutable for each other. (I’m listening to the Doctor Who soundtrack as I type this, in case you were wondering. I wanted a word I knew I wouldn’t find in the Adventure Works database.) Save the changes.

Unfortunately just saving the changes is not enough to make the full text search engine pick up the changes to our thesaurus file. We have to tell SQL Server the file has been updated. The procedure differs greatly, depending on whether you run SQL Server 2005 or 2008.

In SQL Server 2005, you unfortunately must restart the entire full text search service. In a production environment, you never want to do this when the engine is busy or in use. I have a recent blog post that talks about the SQL Server Full Text Catalog Property function. In it, I have this handy query for you to be able to tell what the full text service is doing.

select case FullTextCatalogProperty(‘AdvWorksCatalog’, ‘PopulateStatus’)

          when 0 then ‘Idle’

          when 1 then ‘Full population in progress’

          when 2 then ‘Paused’

          when 3 then ‘Throttled’

          when 4 then ‘Recovering’

          when 5 then ‘Shutdown’

          when 6 then ‘Incremental population in progress’

          when 7 then ‘Building index’

          when 8 then ‘Disk is full. Paused.’

          when 9 then ‘Change tracking’

          else ‘Error reading FullTextCatalogProperty PopulateStatus’

        end

As long as it returns “Idle” you are clear to reset. Even so, I would make sure to do this during a database outage, especially with a production database. You’ve been warned!

To reset, go to Start, Control Panel, Administrative Tools, Services. Scroll down to SQL Server Full Text Search (MSSQLSERVER), right click and Restart. That should get it for SQL Server 2005.

In SQL Server 2008 it’s a good deal simpler. Make sure you are in the right database for your catalog (in this case AdventureWorks) and issue the command:

exec sys.sp_fulltext_load_thesaurus_file 1033;

go

The 1033 on the end refers to the local identifier (LCID) for the language of your thesaurus file. 1033 is for English, US. To discover your LCID, use this simple query:

select [name], alias, lcid from master.sys.syslanguages

OK, we’re at the finish line. Assuming your full text engine has restarted, it should have picked up your new, customized thesaurus file. Now we can alter our original query to:

select [Name], ProductNumber, [Description]

from [Production].[Product] p

   , [Production].[ProductDescription] pd

   , [Production].[ProductModelProductDescriptionCulture] pmpdc

where p.ProductModelID = pmpdc.ProductModelID

  and pmpdc.ProductDescriptionID = pd.ProductDescriptionID

  and CONTAINS(pd.[Description], ‘FORMSOF(Thesaurus, TheDoctor)’ )

And we should get back the same results as the original query.

Please note that all remarks for SQL Server 2008 are based upon the February 2008 CTP 6 release. As future release candidates and final editions are released, some of the information could change. Be sure to check books on line for the latest info.

A final thing to note, if you read yesterday’s post you will note that the FullText keyword uses the thesaurus as part of it’s operation. If you have a custom thesaurus, it’s entries will also be used by FullText.

As you can see, adding custom entries to the SQL Server Full Text Search Thesaurus is not hard, there’s just some steps you need to follow in order to make it happen. Once you know how, you can use the functionality to make searches for your users more productive.

Using FormsOf in SQL Server Full Text Searching

In the past I’ve talked about some advanced text searching techniques for SQL Server Full Text Searching. Another you can take advantage of is FormsOf. Let’s say we’ve setup a full text search index on the Adventure Works ProductDescription tables’ Description field. (For more info on how to do this, go to the Arcane Lessons page of this site and scroll down to the “Getting Started With SQL Server 2005 Full Text Searching” section.)

FormsOf has two ways to use it, the Inflectional mode and the Thesaurus mode. Let’s look at an example, and then I’ll explain the differences. For this example, we’ll combine data from several tables to get back some meaningful information from the AdventureWorks database.

— Example 1 – FORMSOF INFLECTION

 

select [Name], ProductNumber, [Description]

  from [Production].[Product] p

    , [Production].[ProductDescription] pd

    , [Production].[ProductModelProductDescriptionCulture] pmpdc

 where p.ProductModelID = pmpdc.ProductModelID

   and pmpdc.ProductDescriptionID = pd.ProductDescriptionID

   and CONTAINS(pd.[Description], ‘FORMSOF(Inflectional, light)’ )

 

— Example 2 – FORMSOF THESAURUS

 

select [Name], ProductNumber, [Description]

  from [Production].[Product] p

    , [Production].[ProductDescription] pd

    , [Production].[ProductModelProductDescriptionCulture] pmpdc

 where p.ProductModelID = pmpdc.ProductModelID

   and pmpdc.ProductDescriptionID = pd.ProductDescriptionID

   and CONTAINS(pd.[Description], ‘FORMSOF(Thesaurus, light)’ )

Both examples use the same, I admit, rather bizarre syntax. Inside the string you pass to the CONTAINS clause, you put FORMSOF, with the word Inflectional or Thesaurus, a comma, then the word or phrase you want to search form. In this example I use Contains, but FormsOf also works with FreeText as well.

So what is the difference between Inflectional and Thesaurus? Inflectional finds all of the tenses of a word. For example, if you passed in Start, Inflectional will find Start, Started, and Starting. For nouns, Inflectional finds the single, plural, and possessive forms.

Thesaurus works like you would expect a thesaurus to. It will find variations of a word, Start, Begin, etc. Essentially, words that have the same meaning.

How do these relate to Contains and FreeText? Normally Contains looks for an exact match. FullText matches the meaning, but not the exact words in the query. FullText is very similar to Thesaurus, and even uses the Thesaurus in its work. FullText will break out the search string into its individual words, if there is more than one word. For each word, it generates the inflectional forms of the word, then identifies a list of matches for each word based on the thesaurus. FormsOf(Thesaurus… on the other hand just uses the thesaurus to do the search, without going through the inflectional step.

Using a combination of Contains, FreeText, and FormsOf you can give your users some real flexibility, ranging from exact matches to wide open searches.

SQL Server Full Text Search – The ObjectPropertyEx Function

It can be quite useful to have a query that will list all of the tables in your database that currently hold full text indexes. Fortunately there is a function we can use, ObjectPropertyEx. We’ll combine it with a system view named sys.tables which gives us a list of all the tables in the current database.

Note, for todays example I will continue to use the AdventureWorks database, AdvWorksCatalog, and index I created in yesterday’s post on the FullTextCatalogProperty function.

ObjectPropertyEx takes two parameters. The first is the object ID of the table we want to work with. That’s why using sys.tables view is so nice, it already containts both the name of the table and it’s object ID. The second parameter is the name of the property we want to get, in this case TableHasActiveFulltextIndex will tell us if the table has an FTS (Full Text Search) index or not. Here’s a query that will list all of the tables with a full text index:

select t.[Name] as TableName

    , ObjectPropertyEx(t.[object_id], ‘TableHasActiveFulltextIndex’) as IsFullTextIndexed

from sys.tables t

where ObjectPropertyEx(t.[object_id], ‘TableHasActiveFulltextIndex’)  = 1

order by t.[Name]

TableName            IsFullTextIndexed
——————– ——————
ProductDescription                   1

This returns a list of the table name, and the value of 1 indicating the table does have a full text index on it. To get a list of all tables, simply omit the where clause. Tables without an full text index will have a 0 for the IsFullTextIndexed value.

We can further extend this function by using a second property, TableFulltextItemCount. This will tell us how many rows exist in the full text index for this table.

select t.[Name] as TableName

    , ObjectPropertyEx(t.[object_id], ‘TableFulltextItemCount’) as NumberOfRows

from sys.tables t

where ObjectPropertyEx(t.[object_id], ‘TableHasActiveFulltextIndex’)  = 1

order by t.[Name]

TableName           NumberOfRows
——————- ————-
ProductDescription           762

In this case the number of rows matches the number of rows in the table.

This is because I have change tracking set to auto, and have a speedy system. Be aware there are times when this number won’t match the row count. This would primarily be when you have change tracking set to manual, have inserted new rows in to the source and have not issued the command to force the update of the full text index.

By comparing the TableFulltextItemCount to the number of actual rows in the source table, you can monitor your system health and activity. If you see the numbers grow ouside the normal boundaries, it can be an indication something is wrong. You can also use it for adjusting the frequency of manual update commands. If the numbers grow apart too much for your comfort, you may wish to make the updates occur more frequently. Conversely, if most of the time the numbers match you may wish to reduce the frequency of manual updates.

It’s not absolutely required to go to the sys.tables in order to get the count. If you already have the name of the table, you can use it in combination with the object_id function to get the count for that one table.

select ‘Production.ProductDescription’ as TableName

    , ObjectPropertyEx(object_id(‘Production.ProductDescription’)

                        , ‘TableFulltextItemCount’) as NumberOfRows

TableName           NumberOfRows
——————- ————-
ProductDescription          762

Using the above logic you could easily create a function to pass in the name of a single table and get the number of rows in the full text index.

As you can see, using ObjectPropertyEx with its two simple full text search related properties will give you some useful functionality for monitoring your system.

SQL Server Full Text Search – The FullTextCatalogProperty Function

There is a useful function built into SQL Server for dealing with Full Text Search Catalogs: FullTextCatalogProperty . It takes two parameters, the first is the name of the catalog, the second is the name of the property you want. While there are quite a few properties in the list, most of them have been depreciated. However, there are still a handful that can provide valuable information.

For todays examples, I am using the AdventureWorks database. I’ve created a catalog named AdvWorksCatalog, using the following command:

create fulltext catalog AdvWorksCatalog as default;

I then created one index, using this command:

create fulltext index on Production.ProductDescription

  ([Description])

  key index PK_ProductDescription_ProductDescriptionID

  on AdvWorksCatalog

  with change_tracking auto

For more info on these commands, see my original series on Full Text Search which began in June of last year. (See the Arcane Lessons page for links to all the lessons).

The first property we will look at is IndexSize. You can call it up with the following command:

select FullTextCatalogProperty(‘AdvWorksCatalog’, ‘IndexSize’)

This will return the size, in megabytes (MB) of the indexes in the catalog. Since we only have one in this example, it is still quite tiny and thus will return a 0 for the result.

Next, let’s say you’d like to monitor the growth of the catalog by monitoring the number of items. To help with that, SQL Server provides a property named ItemCount.

select FullTextCatalogProperty(‘AdvWorksCatalog’, ‘ItemCount’)

In my case, it returned a value of 762 items in the catalog. An item is the same thing as a row in the source table. In my example, a select count from the Production.ProductDescription table also has a count of 762. These match since I’ve only created a full text index on one table in the database. In your system it will be the grand total number of rows for all the tables you have created indexes for.

A similar statistic is the UniqueKeyCount property, produced with this command:

select FullTextCatalogProperty(‘AdvWorksCatalog’, ‘UniqueKeyCount’)

This will tell you how many unique words were found in the catalog. In the Adventure Works database I’m using for SQL Server 2005, the count was 3,195. This would be like doing a select distinct word on the full text catalog. Each word may appear multiple times (once for each row it was found in) but will only be counted once in the UniqueKeyCount total.

The final property that is useful is PopulateStatus. This returns a value from 0 to 9 indicating what the full text engine is doing right now.

select FullTextCatalogProperty(‘AdvWorksCatalog’, ‘PopulateStatus’)

Since I didn’t want to constantly reference the on-line help to decipher the value, I added a little case statement to my SQL and will let it do the work:

select case FullTextCatalogProperty(‘AdvWorksCatalog’, ‘PopulateStatus’)

          when 0 then ‘Idle’

          when 1 then ‘Full population in progress’

          when 2 then ‘Paused’

          when 3 then ‘Throttled’

          when 4 then ‘Recovering’

          when 5 then ‘Shutdown’

          when 6 then ‘Incremental population in progress’

          when 7 then ‘Building index’

          when 8 then ‘Disk is full. Paused.’

          when 9 then ‘Change tracking’

          else ‘Error reading FullTextCatalogProperty PopulateStatus’

        end

There, much more user friendly. I would envision this would be most useful when preparing to do upgrades to the server. You’d want to be sure all full text activity had ceased before doing a server reboot, or applying a lot of SQL Scripts that altered the full text index structures.

In all of these examples, be aware SQL Server will return a NULL if there is an error. For example, you pass in the name of a catalog that does not exist, or the property is incorrect.

SQL Server Full Text Searching – Can you hear me now?

There are times when you might need to know, programmatically, if the Full Text Search component for SQL Server is even installed, and if so is it installed for a particular database. In SQL Server 2008 this won’t be an issue, as (according to the books on-line, and remember this is still an unreleased product) the full text search engine is integrated into the product and thus will always be there. However, in SQL Server 2005 the Full Text Search engine is an optionally installed component. Worse, by default the FTS engine is not installed, you have to explicitly install it.

Therefore, before setting up your app on a new server it would be a wise precaution to run a simple query and confirm the engine is there. It turns out there’s actually two methods for doing so:

select ServerProperty(‘IsFullTextInstalled’)

and

select FullTextServiceProperty(‘IsFulltextInstalled’)

both behave the same, they return a 1 if FTS is installed, if not installed they return a 0. Be aware they will also return a Null if there is an error (for example, you misspelled IsFulltextInstalled). If it turns out not to be installed, there’s not a great deal programmatically you can do to install new software on the server, you’ll have to have a manual intervention on the part of the system admin or DBA.

Once you have confirmed the engine is installed on the server, you should then check to see if the database has had full text enabled for it. Another simple query fills the bill. The first parameter we pass is the name of a database, the second is the name of the database property we wish to examine, in this case IsFullTextEnabled:

select DatabaseProperty(‘AdventureWorks’, ‘IsFullTextEnabled’);

Like the others, it returns 1 if it’s enabled, 0 if it’s not, or a Null if there was an error (for example, passing in a database name that doesn’t exist). Please note the DatabaseProperty does not indicate the presence of a catalog, it merely indicates whether the database has been enabled for full text searching.

Should the database not be enabled (a 0 is returned) you can enable it by calling a stored procedure:

sp_fulltext_database ‘enable’

This will enable full text searching for your current database. You can also call the stored proc with a ‘disable’ parameter to turn off FTS, but why would you want to ! 😉

By the way, to do most of this you will need DBA level privileges. That’s a logical conclusion, especially if you are calling these as part of a program or set of scripts to create or modify a database.

BarCamp Birmingham 2 Presentations

At last Saturday’s BarCamp Birmingham, I gave three presentations. The first was on Virtual PC 2007. For more info on it just look to my previous post, which has the first video on VPC. I’m currently working on the other videos in the series and should have them up this week.

My second presentation was “The Developer’s Experience”. As promised in the session, here’s the complete PDF of my slides: The Developer Experience. This has hyperlinks to all of the tools I presented.

My final presentation was on Full Text Searching on SQL Server 2005.  First, here is a PDF of the PowerPoint slides: Full Text Search Power Points

Next, most of the demos used SQL statements. This PDF file has all of the SQL plus some associated notes. Full Text Search Demo Scripts

Finally, I didn’t get to demo this at BarCamp due to time, but I do have a WPF project that demonstrated how to call a full text search query from a WPF Windows application. Annoyingly enough WordPress (who hosts my blog) won’t let me upload ZIP files, so I renamed the extension to pdf. After you download the file to your drive, remove the .pdf and put the zip extension back on, then it should expand all the source for you correctly. (Yes, I know, I really need to get a host server for binaries, one of these days I’ll get around to it, but for today…) Source for WPF Demo

See you at the next BarCamp!