The First Abnormal Form

Get the data you need from less-than-perfect tables


Let's face reality: The world isn't perfect. All the database tables in the fifth normal form, hyper-efficient queries, and unlimited hardware budgets that you read about in articles and books are the way things should be. Now let's talk about how things often are. You'll inevitably come across some VSDBs (Very Strange Databases) in your career. Most of us have opened a database diagram in Enterprise Manager and wondered, "What in the world were they thinking?" I'm not blaming anyone. Database schemas tend to have unusual logic for several reasons, not just because their designers lacked knowledge of good database-design practices. Small companies usually don't anticipate explosive growth when designing their databases and applications. But a system that handled 100 users well can find itself in a disastrous situation when the company grows faster than expected. These companies rarely consider redesigns or upgrades, so a kind of patched-up database evolves. The Internet era has also contributed to the rise of shoddy database design. Companies expect Web sites to be up and running yesterday in the race for e-business, so developers adopt a mentality that a dear friend of mine calls the "shoot now and aim later" school of application design.

Whatever the reasons, quirkiness prevails in some parts of the database world. Let's look at some practical ways of dealing with two common but unfortunate situations that you'll likely encounter in dealing with relational databases: tables with multivalued columns and tables with repeating groups.

Why Be Normal?

As you know, a set of rules known as normal forms governs relational database design. I won't go into detail about these rules here, but the first rule—or first normal form (1NF)—states that each attribute in a table should be atomic (of a single value) and nonrepeating. Hence, you need to model a one-to-many (1:M) relationship as two entities: one to contain the parent and another to contain the children.

Let's look at an example e-commerce situation involving an imaginary company that sells toys and novelties. (For the example to work, you need to download and run the .sql file that accompanies this article—see "More on the Web" for instructions. Then, create a DSN called ToysDSN that points to the database containing the example tables. You also need a copy of the include file in the Microsoft IIS directory that you're executing the Active Server Pages—ASP—examples from.) The company's legacy database uses an item ID number and an item option (e.g., size, color), which an application eventually writes to an Orders table. Figure 1 shows two tables, Items and Options, which contain the items and the options for each item, respectively.

This schema is a normalized example of a 1:M relationship; it's modeled with one entity for the items and another for the available item options. Item 100, a leisure suit, comes in three sizes—small, medium, and large—and the SQL to find the options for item 100 is straightforward:

SELECT ItemOption FROM Options WHERE
  ItemId =  100

On your Web site, when a user clicks the leisure suit link to show the item details, you want to populate a drop-down list with the item options. With the normalized design, this process requires only the preceding simple SELECT statement, as the ASP code in Listing 1 shows.

Additions to the option list are also straightforward. Say that big and tall people began to email your company, requesting your leisure suit for parties and class reunions. After instructing your supplier to begin manufacturing plus sizes, you can just insert the new options like this:

INSERT INTO Options  (ItemId, ItemOption)
  VALUES (100, "Extra Large")
INSERT INTO Options  (ItemId, ItemOption)
  VALUES (100, "Extra Extra Large")

You can now offer the leisure suit in five sizes, and—this is important—your application code has remained unchanged. The original SELECT statement will populate the drop-down list correctly, no matter how many sizes you put in the Options table.

Now let's look at the same scenario using a table in a state I call the first abnormal form (1ANF). Let's say that your legacy system uses an item ID and option code, but the database was modeled with only table AB1, which Figure 2 shows. As you can see, the designer defined all the items and item options in one entity, with one column for each option. You might even have the misfortune of needing to deal with a variant of a 1ANF table, like table AB2 in Figure 3. Table AB2 has three columns; the third one contains a character string of all available options delimited by a comma. You might wonder what's wrong with creating tables like these, but consider the following: How easy is adding an option? Removing one? Can users search for all items that offer a particular option? Looking at AB1 and AB2, you can see that the answers to these questions aren't clear.

So, how do you effectively deal with tables that are in the first abnormal form? Ideally, you split the original table into two tables and create a 1:M relationship. But in some cases, you won't have that option, as the DBA of the legacy system will be happy to tell you. Legacy databases often have many processes and applications that operate against them, making structural changes to the databases impractical. Proceeding on the assumption that you're stuck with a 1ANF table, let's address tables AB1 and AB2.

AB1: Multivalued Columns

Your ability to add and delete options in table AB1 is quite limited. The nature of this table requires at least as many option columns as necessary to hold the item with the most options—and preferably anticipates that you'll need to add some options in the future. For example, if the item with the most options has six, the table might have eight option columns.

This table design also introduces many nulls into the table. Say your leisure suit hasn't been selling well in the small size, so you decide to discontinue that option. To remove it, you can code

SET ItemOption1 = Null WHERE ItemOption1 =
 "Small" AND ItemId = 100

The problem with this update is that it assumes you know that Small is entered in ItemOption1 and not ItemOption2 or ItemOption3. But what if you had 10 or 20 option columns and you didn't know which one contained the entry Small? Then, the SQL code becomes very repetitive:

SET ItemOption11 = NULL WHERE ItemOption1
 ="Small" AND ItemId = 100
SET ItemOption2 = NULL WHERE ItemOption2
 ="Small" AND ItemId = 100
SET ItemOption3 = NULL WHERE ItemOption3
 ="Small" AND ItemId = 100

Note that your application must perform a separate update for each column the option could possibly exist in. So if 20 option columns existed, the application would need to issue 20 separate UPDATE statements to delete the option Small.

Adding a new option to an item also poses a problem. If all the columns already have entries and you need another option, your only choice is to add a column to the table. Even if you have available columns, you have to decide which column you should add the new option to. In the preceding example, let's assume that you need to add the size Extra Large. As you do when you delete an option, you need a separate update for each column. However, to ensure that you don't replace all null columns with Extra Large, you need a way to know the first time an update succeeds. For this purpose, you can use the @@rowcount system function, which returns the number of rows affected by each UPDATE statement.

The T-SQL block in Listing 2 attempts to insert 'Extra Large' into ItemOption1 and uses @@rowcount to determine whether the update was successful. If not (i.e., if @@rowcount = 0), the code moves on to ItemOption2, and so on. Querying ItemID 100 after running this code shows that column ItemOption1 contains the new value 'Extra Large'.

So, how has the SQL for populating the drop-down list changed? With the following SELECT statement, you can no longer loop though the ItemOption column from your normalized tables to populate the select box:

SELECT ItemOption1, ItemOption2, ItemOption3 FROM AB1 WHERE ItemId = 100

The ASP code must now contain ItemOptions 1, 2, and 3 in separate <option> tags. Also, you now get a blank line if one of those columns is null, so you need to make sure that the column value is not null before inserting it into the drop-down list. As Listing 3 shows, this process involves much more work than before. However, you can decrease the amount of work necessary by creating a stored procedure that inserts the option values into a temporary table. The procedure essentially turns the columns into rows, letting you again loop through the ItemOption column in the temporary table. The procedure in Listing 4 creates a temporary table to store the values from AB1's ItemOption1, 2, 3, ...n columns in one column called ItemOption. The procedure then selects only the non-null values from the temporary table into the resultset. Because you're now dealing with rows instead of columns, you can again loop through the resultset and populate the drop-down list, as Listing 5 shows.

Can you search for all items that offer a particular option? Yes, but this process also involves more work than it needs to. Let's say one of your customers is a cheerful person who wants to find all items that come in yellow. You can easily set up a search form, but what does the SQL look like? In the first example with the normalized tables, this search is a fairly simple join:

SELECT I.ItemName FROM Items AS I Inner Join
 Options AS O
On I.ItemId = O.ItemId AND O.ItemOption =

But for table AB1, the SQL mutates into

ItemOption1 = "Yellow" OR
ItemOption2 = "Yellow" OR
ItemOption3 = "Yellow"

and so on, up to the highest ItemOptionn column.

As you can see, dealing with AB1 produces some long, repetitive SQL that takes much more time to execute than the joins and simple INSERT and DELETE operations from the normalized example. Also, table AB1 has only three option columns. If it had 10 or 20, for example, the above data manipulation language (DML) operations would grow even larger.

AB2: Repeating Groups

Now, let's address the same operations that you issued against table AB1, this time with the other 1ANF table, AB2. Take another look at the structure of AB2 in Figure 3. Whereas AB1 has separate columns for each option, AB2 has a single column containing a character string. So how do you deal with a delimited string in T-SQL when the first normal form states that each column must have one value?

The whole structure of AB2 is antirelational, and I urge you again to split this table into two entities if you can. I would suggest that you shouldn't even attempt string functions in T-SQL. SQL Server and the vast majority of relational database engines are optimized for set-based operations, not string operations. You'll probably get much better performance if your application, not your database engine, handles these particular string functions. But for those who are stuck with a table in this format, let's look at how you can use T-SQL statements to manipulate AB2.

First, you need some system functions: the LEN() function to determine the length of a string, SUBSTRING() to return only a portion of a string, and REPLACE() to replace part of a string with a new one. I also recommend using a function to trim the white space from the strings. In this case, you can use LTRIM(RTRIM(string)) to remove the leading and trailing blanks from string.

Let's run through the basic DML operations on this table. First, inserting a new option is easier than you might think. Let's again insert the option Extra Large. The statement

SET Options = Rtrim(Ltrim(Options)) + 
",Extra Large" WHERE ItemId = 100

updates the options column of the item with an ItemId of 100 by concatenating the text, 'Extra Large' to the end of the existing string. The statement also trims the text to remove any trailing or leading blanks from the new column value. When you query AB2 now, ItemID 100 returns 'Small,Medium,Large,Extra Large' in the options column.

Deleting an option is a bit more involved. You might think a simple REPLACE() would work, but consider the scenario of deleting 'Large' from the option string. If the string also contained other options that use the word Large (such as Extra Large and Extra Extra Large), all occurrences of the characters 'Large' in those options would be replaced as well. The stored procedure DeleteOption, which Listing 6 shows, addresses all placement and duplication problems within the string.

This procedure takes two parameters: @option, which is the name of the option to be deleted, and @itemid, the item to remove the option from. DeleteOption has four conditional blocks that check for the placement of the option you specify in @option. In this example, let's use 'Large' again:

EXEC DeleteOption 100, "Large"

The procedure's first IF checks to see whether 'Large' is anywhere in the middle of the string by surrounding the string 'Large' with commas and using LIKE to test the condition. If it exists in mid-string, 'Large' and its surrounding commas are replaced with a single comma. Surrounding @option with commas ensures that you replace only 'Large' and not the other options containing those letters.

By adding a comma to the end of @option, the second IF block determines whether 'Large' is in the first position in the option string. The code also ensures that no other characters precede it in the string; without this check, the update might simply remove the letters 'Large' from the first option. The procedure then performs an update, using SUBSTRING() to remove the length of @option plus one character to account for the comma. The third IF block is similar to the second, but it determines whether 'Large' is the last option.

The fourth and final IF block determines whether 'Large' is the only option in the string. The fourth block is necessary for any table that allows a null value. The logic of the complete stored procedure is also necessary to avoid having a string full of rogue delimiters or misplaced commas that would confuse the other operations on this table. Executing the DeleteOption procedure sets the Options string for ItemID 100 to 'Small,Medium,Extra Large'.

A search of options is relatively straightforward. To serve the customer who wants to find all yellow items, you can execute

SELECT ItemName FROM AB2 WHERE Options LIKE "%yellow%"

This statement returns all items that have an option of yellow anywhere in the string. Surrounding the search word with the wildcard % accounts for commas and the variable placement of 'yellow' within the option string.

Abbey Someone

You might be thinking that working with AB2 has been relatively painless and that AB2 isn't too bad after all. But let's examine the T-SQL code in Listing 7, which you'd need to populate the drop-down list from AB2. First, compare the code in Listing 7 to the following T-SQL code that I used in the first example with two entities:

SELECT ItemOption FROM options WHERE
  ItemId =  100
ORDER BY ItemOption

The difference is significant. The above simple SELECT statement would perform orders of magnitude faster than the GetAllOptions2 stored procedure from Listing 7. Let's break down the new stored procedure to see what makes it so cumbersome. You need several variables: @length, to hold the length of the options string plus 1; @position, to keep track of which character the SUBSTRING() function is currently examining; @add, which holds the contents of the options string one character at a time; @var, which holds the cumulative single characters that make up a word; and @list, which stores the whole string from the Options column.

The code at callout A in Listing 7 creates a temporary table to hold the options in rows as you build the options from the individual characters. At callout B, the code declares and initializes the variables. It sets @position to 1 and @list to the contents of the Options column in the row that has an ItemId that matches the @ItemId you pass to the stored procedure.

Using a WHILE keyword, the main block of the stored procedure loops until it has read every character from the original string. Callout C shows this first part of the main block, which determines whether the character it's currently analyzing is a comma or whether it's at the end of the string. If either condition is true, the code inserts the current contents of @var into the temporary table. The first character isn't a comma, so the code puts the character into the variable @add, then at callout D sets @var to itself plus the new character.

In this way, the procedure builds the individual options one character at a time. The code then increments the position so it can analyze the next character. The loop continues until it finds a comma, then the code inserts @var into the temporary table and clears that variable out so it can hold the next option. The procedure continues until it reaches the end of the string and each option exists as a row in the temporary table. When you have the options stored as rows in the temporary table, you can use the code that Listing 8 shows to loop through the query and populate the drop-down list.

Nobody's Perfect

The examples in this article demonstrate the benefits of using simple SQL statements and a normalized design instead of costly, time-consuming string operations. Like most database engines, SQL Server isn't optimized for the operations we've just done on the abnormal form tables. Performance suffers noticeably when you're using 1ANF tables like AB1 and AB2. Nonetheless, with the examples I've presented in this article, if you have absolutely no choice, you can effectively deal with some of the problems that tables in the first abnormal form present. Always remember that using 1ANF is very bad design practice. I implore you to never implement such a table, no matter how small you think your application will be. You don't want to fall victim to the headaches that unexpected growth causes.

Hide comments


  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.