Improving Retrieval Performance

Downloads
27363.zip

Storing duplicate copies of the same word in the full-text table allows the possibility of duplicate results. Using the table FT_EmailSubject that Listing 2 in the main article creates, assume that FT_EmailSubject contains the row values that Table A shows. If you perform a search for the word Rose, the query returns MessageID 1 twice unless you collapse the MessageIDs by using a DISTINCT clause or a GROUP BY MessageID clause. But both of these methods cause an additional sorting step at query time.

To avoid this sorting cost, you can rewrite the queries to use an EXISTS predicate instead of a DISTINCT clause, as the query in Listing A shows. An alternative to this approach that reduces storage space and improves performance for some queries is to use a table like the one that Listing B creates, FT_EmailSubject2. MessageID and Word comprise a composite primary key on this table. In filling this table, the population process must filter out duplicate words. This filtering pushes the sorting cost onto the population process instead of making SQL Server perform a sort during each search.

Another improvement you can make is to include a NumAppearances field in the FT_EmailSubject2 table and store the number of times each word appears in the original string. You can use this field to indicate how relevant the results of a query are. Note that in both of these variations, the ability to perform nearness and phrase searches is lost because information about where the word appears in the original string is no longer available. If the position of individual words in a string isn’t stored, the searches available are effectively limited to OR and AND searches.

Comments

Plain text