Skip navigation

The Search Path Less Traveled

Don't take a keyword-array shortcut; you might get lost

Download the Code iconYou've been using keywords to help you locate data since you were a kid. Maybe you used a card catalogue the first time you went to your local library, or maybe the nuances of online search engines taught the youthful you about keywords. However you learned about them, you probably use keywords for database design, sorts, and searches in your daily work. Because keywords are so familiar to us, it's easy to use them to create shortcut database designs, but be careful; such shortcuts can hurt performance. Keyword arrays are a common example of a bad database-design shortcut. Keyword arrays hurt database performance because they often force table scans. Using an array can inhibit the query process by shielding or "hiding" some of the objects a user is seeking if the user doesn't write the query exactly right.

Let's look at an example that shows how common and problematic keyword-array design is. Recently, I was using a Web application from a screen-printing service to create a t-shirt design. I went to a commercial Web site I had visited before because I remembered that it contained five or six American Southwest images. I wanted an image of Kokopeli, the hunchbacked flute player, a Southwestern Native American symbol of joy. However, when I entered the keyword southwest, the Web application returned only four images, none of them the image that I was searching for. Figure 1 shows a screen shot of the images the Web application returned. Fortunately, I have access to the underlying database, so I used Query Analyzer to perform a direct keyword search of the image table.

Figure 2 shows the result of a direct keyword search I ran against the asset table, using wildcard characters (%) to find '%southwest%' images. Let's look at the layout of the asset table in Figure 2. Notice that images aren't stored directly in the table. Instead, each table row contains pointers (in the asset_image column) to images on the Web site—in this case, the pointers are .jpg file names and the image files are stored on the Web server. Each table row also contains a column for an identifier (ID), a number or string value that's a common identifier for the image (name), and words that describe each image (keywords). You can see that the database contains more than the four Southwestern images that Figure 1 shows. My keyword search returned seven images.

Because keywords are strung together in a single field, both the Web application and I have to use wildcard characters to improve query accuracy. For a good explanation of how to use wildcard characters in a query, see the "Pattern Matching in Search Conditions" section of SQL Server 2000 Books Online (BOL). The design of Figure 2's table will eventually result in poor database performance because it forces table scans when the query involves a pattern match and the wildcard character is at the beginning of the string expression. If the wildcard character is somewhere in the middle or at the end of the string expression, the optimizer can use an index—assuming that an index is available.

Although table scans have been optimized in SQL Server (as the "SQL Server Architecture: Reading Pages" section of BOL explains), database performance with large tables almost always benefits from using an index rather than a table scan (see the BOL section "Creating and Maintaining Databases: Designing an Index"). Even if the keywords column had been indexed in this database, SQL Server wouldn't have been able to use the index to help with data retrieval for any term that wasn't located at the beginning of the string.

The keywords column, which stores keywords as array strings, makes this table a victim of bad design. Arrays can be useful structures within a programming language; they let you easily and efficiently manipulate data outside the database. However, an array is a poor storage structure within a relational database. One way to describe the problem is to say that the asset table is unnormalized. (For more information about data normalization, see my article "Normalization: Why You Need It," March 1999.) When you use an array to store keywords, you stuff a many-to-many (M:N) relationship into a single table. Each row in the asset table contains a M:N relationship between the image and the keywords. An image is an entity that can be described by one or more keywords; a keyword is an entity that can describe one or more images.

For example, Figure 2 shows five keywords (Southwest, art, bull, native, American) that describe image ID 3677. Because each of those keywords describes many images in addition to image ID 3677, the keyword must be repeated in many images' keywords column. Figure 2 shows the massive duplication of terms this design generates. In addition to data duplication, the keyword arrays you see in Figure 2 give you some idea of how you'd need to construct your SELECT query to extract the data you need. But an ordinary customer of this Web site doesn't have access to the underlying data, and an ordinary customer most likely wouldn't know how to write a SELECT query. The ordinary customer is at the mercy of the Web developers who wrote the queries or the database programmers who wrote the stored procedures that generated the results in Figure 1.

But database performance problems and duplication of terms aren't the worst effects of keyword-array design. The main reason to avoid keyword arrays is that they confound successful query writing. Because data entry isn't controlled or standardized, we find the kind of variations on a term (e.g., Southwest instead of Southwestern) that led to my first failed query. The possibility of misspelling, omissions, or incorrect definition (I wouldn't find any of these images if I query for frontier or Anasazi) means that the Web site could contain even more images of interest to me than the seven I ultimately found. Listing keywords and key phrases as a continuous string with no separators between words and phrases also obscures the intent of the person who entered the data. Instead of the five keywords listed for image 3677, perhaps the creator intended to describe the image with one word (bull) and two key phrases (Southwest art and Native American). We have a problem of lost intent and mixed meaning regarding the keywords.

Figure 3 shows the proper way to design any M:N relationship: by using three tables. In this case, we need an IMAGE table, a KEYWORD table, and an intersection table (IMAGEKEYWORD) that associates keywords with images. You should add a properly designed UI for data entry to restrict keyword entry to terms that are present in the KEYWORD table. The interface will reduce the kind of variations on keywords that led to my problem.

Web Listing 1, which you can download at InstantDoc ID 44273, contains the code that you can use to create the three tables that Figure 3 shows, and you can populate the tables by using the scripts that Web Listing 2 contains. Then, you can test the data and run the JOIN query that Listing 1 shows to see how easy it is to return all seven Southwestern images.

Arrays seem to be a familiar, comfortable, and easy design shortcut. But you don't need to be a trained database designer to see that unnormalized data is a shortsighted shortcut that will lead to poor database performance and unsuccessful query results. Stick with simple principles of good database design, such as using three tables to represent an M:N relationship, and you'll save time and trouble.

Hide comments


  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.