Counting Instances of a Word in a Record


I'm trying to create a T-SQL query that counts the number of times a word displays in a specific record. For example, I have the query

SELECT long_desc FROM table

Can I determine how many times the word stock appears across all instances of the column long-desc?

T-SQL doesn't provide particularly rich support for string operations such as this one. You could get better performance and richer functionality for solving the problem in the question by using full-text indexing in SQL Server 2000 and 7.0. However, I can describe a T-SQL solution and a set-based solution that work.

SQL is a set-based language, and set-based languages don't provide easy ways to work with subsets of data in a column. To build the query you want, you must first determine how many times the target word, stock, exists in each row.

T-SQL doesn't provide a built-in function to count the number of times a particular string appears within another string, so to find out how many times a word appears in a row, you have to build your own count function. SQL Server 2000 lets you create this kind of user-defined function (UDF). Listing 3 shows a UDF called WordRepeatedNumTimes() that accepts two parameters: @SourceString and @TargetWord. The function returns an integer value that represents the number of times that @TargetWord is repeated in @SourceString.

I've created a sample test to show how you can use this function to determine how many times a word appears across all instances of a column. Create the function in a database called TEST, and create the two tables titles and titles_big by running the script that Listing 4 shows. Titles is a copy of the titles table from the Pubs database. Titles_big has the same structure as titles, but I've removed the primary key and created a table that has 18,000 rows.

The following query counts the number of times that the word the appears in the titles column of the titles table:

SELECT SUM(dbo.WordRepeatedNumTimes
  (title, 'the'))
FROM titles

Let's see what happens when you run the query against the larger data set in titles_big:

SELECT SUM(dbo.WordRepeatedNumTimes
  (title, 'the'))
FROM titles_big

The query still runs, but it took 2 minutes and 23 seconds to run on my laptop. A simple SUM() of the price column returned in a few milliseconds. So, you can learn a few lessons from this example:

  • UDFs let you extend T-SQL in many useful ways.
  • You shouldn't perform an activity just because you can. T-SQL is a set-based language, and you must consider the performance implications of performing nonĀ­set-based activity such as this solution in the database.
  • Usually you can find more than one way to solve a SQL problem. The solution I wrote works and solves the reader's problem. However, in this case, the reader is probably better off thinking outside the box and implementing full-text indexing.

You can also create a set-based solution for counting the number of times a word appears in a string. SQL Server Magazine contributing editor Itzik Ben-Gan suggested this clever and simple solution. The following query shows how many times the word the appears in the title column:

SELECT title, (LEN(title) - LEN(REPLACE(title, 'the', ')))/LEN('the')
FROM titles_big

And the next query sums the occurrences of the across the entire table:

SELECT SUM((LEN(title) - LEN(REPLACE(title, 'the', ')))/LEN('the'))
FROM titles_big

You simply replace all occurrences of the target word with an empty string, then compare the length of the string before and after the replacements. So if the original string is nine characters longer than the replacement string and the target word the is three characters long, you can see that you removed three occurrences of the target word. You can usually replace a row-by-row approach with set-based logic, as Ben-Gan's example shows.

Hide comments


  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.