In the first column I wrote for SQL Server Magazine (SQL by Design, "Why You Need Database Normalization," March 1999), I set out some rules to live by for database design. Recently, I wrote a six-part series about data modeling in which I revisited the topic of data normalization. While developing the sample SQLmag database in these articles, I showed how to normalize data. (See "The SQLmag Database DDL Script File," for information about how to get a copy of the Data Definition Language—DDL—file that created this database.)
When you normalize a database, you end up with a set of tables, each of which describes or is concerned with a limited part of the schema. In normalizing a database, you're trying to do several things. First, you're trying to arrange data into logical groupings or subschemas so that each group describes some small part of the whole story. Second, you're trying to ensure that each column of each table clearly defines and describes its data content and that you leave no ambiguity about the data you're storing. Third, you're trying to minimize the amount of duplicate data that could potentially be stored in a database. Also, you're trying to organize the data so that when you need to modify it, you make the change in only one place. And finally, you're trying to build a database that functions quickly and efficiently without compromising the integrity of the data in storage.
That said, sometimes deviating from good normalization practices seems logical or practical. Well-normalized databases are complex to program against. Application programmers who are charged with creating user front ends usually have to use multitable joins, subqueries, or views as data sources. This method can make the application run so slowly as to be unusable. Also, a well-normalized database is so complex and so fragmented into data groups that data readers often get confused and disoriented while trying to find information.
Under certain circumstances, you can denormalize your database. Keep in mind, however, that every time you do so, you pay a price. The cost might be in lost flexibility, in future scalability, in performance, or in data integrity. You might find that by denormalizing, you have much more data redundancy in the database; when modifying data, you'll have to manage this redundancy through program code, either at the user interface (UI) or by using triggers. Let's look at some examples of these conditions.
Figure 1 is a variation of part of the SQLmag physical model. In this figure, the Person table contains information about each person, and the CommDevice table contains information about how to contact that person. You can associate each Person record with an unlimited number of related CommDevice records. The attribute CommType might hold values such as home phone, work phone, cell phone, home email, work email, or home page. Each corresponding CommValue would be the string you use to implement the CommType (e.g., a phone number, an email address, a URL to the home page). This setup is a typical one-to-many relationship; for each record on the one side, you can create an unlimited number of records on the many side. This design is normalized, and although normalization is the proper way to organize data, your data users might want to be able to get all the information about a person, including all phone numbers and email addresses, from one record. You could decide to denormalize these two tables, combining the attributes into one table, as Figure 2 shows.
Figure 2, a denormalized variant of Figure 1, is the design you'll find in the SQLmag sample database. The figure contains all the attributes from the Person table but only two methods of contacting each person—a phone number and an email address. Every time you denormalize two tables that are in a one-to-many relationship, you have to limit the number of "many" conditions to represent in the denormalized table. The table in this figure provides one phone number and one email address without properly identifying them; you have no idea whether that phone number refers to home phone, work phone, cell phone, or fax line. You can add an attribute called PhoneType that refers to the Phone column to let you designate which kind of phone number you're recording. You have a similar situation with email addresses; you don't know whether the email address you're capturing in a record is a home email, a work email, or a special-use email address. Again, you could add another attribute called EmailType, which would let you define the type of email address as you add the record. But what happens when you need to store a second phone number for a person? You could add two new fields, Phone2 and PhoneType2. If you need to gather and store more contact information, you could add fields Phone3, PhoneType3, Phone4, PhoneType4, Email2, EmailType2, Email3, EmailType3, and so on. These additions create an embedded array of phone numbers and email addresses in the Person table. Unfortunately, this array complicates the querying and data-retrieval process. By denormalizing, you sacrifice the flexibility and extensibility that the first table architecture (Figure 1) contained.
You can handle this situation in another way, depending on whether you'll need to add another phone number or email address rarely or frequently. If rarely, you can add another attribute called Comments and record additional phone numbers or email addresses in this field. If multiple phone numbers or email addresses will be common, you might want to change the Phone and Email attribute names to be more descriptive, as Figure 3 shows. In this version of the Person table, you have the capability to store HomePhone, WorkPhone, CellPhone, FaxPhone, HomeEmail, and WorkEmail. I've even added an attribute called Comments in which you can store additional phone numbers that don't fit anywhere else or add commentary about any of the phone numbers or email addresses in the record.
Will you ever have a problem managing the data in this denormalized table? Let me count the ways.
- Will you need to record four phone numbers and two email addresses for each person in the database? If not, you'll have a lot of wasted storage, and you'll have to deal with NULL when programming and retrieving data.
- How can you tell which is the day phone and which is the evening phone? Some people work at night and don't want calls during the day.
- Is the fax machine at the person's home or at the office? What do you do if a person has both and wants both recorded in the database? Do you record the second in the Comments column?
- What if a person insists on having three email addresses stored in the database? Do you use the excuse that "the database won't let me do it?"
- Where would you store a person's digital pager number? In the Comments column? Storing phone data in a column labeled "Comments" is a misuse of the Comments field. And according to the definition I gave earlier, each column of each table must clearly define and describe its data content, leaving no ambiguity about the data you're storing.
- When you get a request to change someone's phone number to (505) 111-5678, how do you know which number to change? You could easily overwrite the wrong phone number and lose valid data.
Each of the points in this list of problems could result in troublesome storage situations for the table in Figure 3. None of them would be a problem if you were using the table architecture that Figure 1 illustrates. Besides trying to fit square pegs into round holes, you'd have to rely on the column header values (e.g., "WorkPhone") to identify the data in each column. Relying on column header values is always risky and isn't a good design standard because you take the chance of losing these column header values when you export or import data from one system to another. Look at the set of phone number attributes in Figure 3, and imagine real data in the table. Then, imagine losing the column headers. Now, imagine yourself trying to figure out which column contains the fax phone number.
Most DBAs talk about the performance penalties they incur in a normalized database when they have to join tables and create views for reporting. In fact, one of the most popular arguments in favor of denormalizing is to avoid the performance problems that multitable joins cause. But don't forget about the performance penalty caused by the database management system (DBMS) reading fewer records into memory per I/O operation because you've denormalized the data. Let's say that you want to generate a listing of first name, last name, and hire date for everyone in the Person table. The blocking factor for the Figure 1 table architecture (tblPerson only) is approximately 25 records. (See "The Physical Design," August 2000, for instructions on calculating the blocking factor.) This blocking factor value means that each read or write operation will include 25 Person records, which is good because this listing requires a sequential operation with some filtering. The blocking factor for the Figure 3 version of the table is only 15 records per block, or 40 percent fewer records than SQL Server could read with the Figure 1 table architecture. The extra bytes that the Figure 3 table structure is hauling around with each physical I/O could cause a significant decrease in your database server performance. Of course, you could make all the phone fields variable character, which would minimize the blocking factor difference and thus tip the balance in favor of the denormalized design. You'll need to consider all these facts when you're thinking about denormalizing.
Now that I've (again) knocked the idea of denormalization, let me say that sometimes you really can denormalize and encounter only limited negative effects. Figure 4 is from a part of the SQLmag physical model that better illustrates my next point. This figure is a modification of the one-to-many relationship Pub_Category to Publication. The table Pub_Category is a lookup table containing a listing of the valid categories that can identify the various publications listed in the Publication table. In the original data model, the foreign key in the table Publication, which connected the two tables, was the primary key of Pub_Category (PubCatID). In the denormalized version, PubCatDscr (the category name) is the foreign key.
This kind of denormalization doesn't necessarily reduce the number of tables in the database. The lookup table candidate key (PubCatDscr), rather than the primary key, is the foreign key link. The lookup table, which before denormalization contained the only list of publication categories in the database, is now valuable only as a way of enforcing domain integrity for the publication category.
This kind of denormalization carries minor penalties. The byte count for the denormalized table might be higher than that for the normalized table, so you might read one or two fewer records per block. If you modify a PubCatDscr value in the lookup table (tblPub_Category), you'll need to remember to modify the same value in tblPublication (you can code a trigger that cascades the change). The gains usually outweigh the losses in this kind of denormalization. Every Publication record carries in itself a description of its publication category type instead of just an identifier. You no longer need to join the two tables to find out what category a publication belongs to; the Publication record contains that information.
Data redundancy, or duplication of data (with accompanying data-management complications), will always be part of a denormalized design. The examples I've used for this article minimize the data redundancy problem, which is what you want to aim for when you denormalize. "Why You Need Database Normalization" shows just how great the levels of data redundancy can be in a denormalized database.
Nontechnical data users need to see data in more of a flat-file format than a set of well-normalized tables can offer. Specifically, they usually want to see information about a person as a single record that contains everything—name, address, and all phone numbers and email addresses. You can use views and stored procedures to hide some of this normalization complexity. The drawback is that this approach restricts the data users to accessing data only through these views and stored procedures; they generally won't be able to read the data directly from the tables. If that approach will work for your situation, you'll need to create a set of views and stored procedures and teach your data users how to use them for data access instead of directly reading the tables. In past columns ("Views and Stored Procedures," September 1999 and "A Data Access Solution," February 2000), I've talked about schemes to use views and stored procedures for data access. I've found in my client sites that these schemes work well; if they have a downside, it's that my customers keep requesting new ways to view the data.
In summary, denormalize only if you can demonstrate a performance gain (usually by not needing to perform multitable joins for each query) and only if losses in data integrity don't offset the performance gain. Data user convenience alone is never a justification for denormalizing. You can give users what they want by using virtual denormalization techniques, such as creating views and stored procedures.