Lookup tables provide the essential function of helping you maintain data integrity in your database environment. For example, if you have users entering their gender into a data item, the table that contains the Gender item can reference a lookup table to verify that only the value M or F is used.
Most database environments have one lookup table to support each domain—a simple, scalable model. But recently I've read some articles proposing a new approach to lookup tables. Instead of one lookup table to support each domain, some database designers are advancing the idea of a single large lookup table that contains all valid values for every domain. The table uses some sort of indicator column so you know when and where to use the value in a column. Readers have asked me whether the one-table approach is better than the many-table technique. Let's take a look at how lookup tables function and why, in this case, breaking the rules of database design isn't a good idea.
First, let's review how you can use lookup tables throughout the database to restrict data entry while enforcing domain integrity. One of the most common uses for a lookup table is verifying the two-character state code. The entity relationship diagram (ERD) in Figure 1 shows a frequently used two-column version of the State lookup table. The first column in the State table is the two-character StateCode; the second column is StateLongName. While this is a good way to avoid mistakes on data entry ("Let's see, is Alaska AL or AK?"), it's also a way to enforce domain integrity. The domain of StateCode is the set of 50 two-character state abbreviations recognized by the US Postal Service. The domain of StateLongName is the 50 state names, from Alabama to Wyoming. No other combination of letters, numbers, and special characters is valid for state name.
Another name I use for lookup tables is reference tables because their data is referenced by columns in other tables. The traditional one-to-many relationship of the parent reference table, State, to the child table, Address, is also shown in Figure 1. Figure 2 shows the Joins tab of the Reference Properties window; you can see that the Address table's StateCode column is constrained by the set of values in the State table's StateCode column. So, for example, if you want to enter the value AK into the Address table's State-Code column, the value AK must already exist in the State lookup table's StateCode column. That's how a lookup or reference table works. Selecting the Mandatory parent check box on the Integrity tab, as Figure 3, shows, ensures that this relationship is enforced within the database.
Typically, more than one table in the database uses the same lookup table. In addition to the Address table, any other table that has a column containing the two-character state code could reference the State lookup table. For example, say this database also has a ShippingRate table that contains rates by state. If the ShippingRate table has a mandatory relationship with the State table, values in the State column of the ShippingRate table are constrained by the set of valid values for the StateCode column of the State lookup table.
Breaking the Rules
Now back to the original question of whether a one-table approach is better than using separate tables for each domain. While the single-table design might be a clever solution, it's not one that I would use for the simple reason that I like to design scalability into my databases. I much prefer individual lookup tables that enforce integrity in each distinct domain. By dumping all lookup values into one table, then adding a column (or using some other device) to decipher what the individual lookup values refer to, you violate three rules of designing for performance and data integrity:
- Always normalize your database.
- Design for performance and scalability.
- Simplify data management.
The following problems can arise when you break these rules.
You denormalize the lookup table. Remember that by the rule of normalization, each table contains data about one thing (one entity) and one thing only. While you might argue that a lookup table is only about reference data, and thus contains only one type of data, I think this is one level of abstraction too high for the real world. In the traditional approach, the data in a lookup table references only gender, only states, only room types, or whatever might be in a list of related items that you use to maintain the domain integrity of your database.
You create a performance bottleneck and limit scalability. Putting all lookup values, indicators, and codes into the same table requires every application that uses a drop-down or list box or that does a compare against the lookup table to access the same lookup table. This approach creates a "hot spot" on the disk on which the lookup table is located. That hot spot results in higher levels of locking and blocking, increased wait times, decreased performance, and an inability to scale beyond a certain point. "But," you say, "I'm running SQL Server 2005, and I can horizontally partition this large lookup table." However, SQL Server 2005 horizontal partitioning is meant for very large data sets. On smaller data sets (an all-in-one lookup table would be considered a small data set), horizontal partitioning might even be counterproductive. (I can't say for sure, because I haven't run any SQL Server 2005 partitioning benchmark trials yet.) Also partitioning isn't an option unless you're running SQL Server 2005 Enterprise Edition.
If you run SQL Server 2000, the best approach might be to cluster the all-in-one lookup table, but what criteria will you use to determine the optimal clustering index? You can cluster traditional lookup tables by the code name, by the long name, or by some third column of data, such as a "sort by" column. But if all lookup values for all domains are together in one table, how will you choose the optimal cluster key?
You obfuscate the meaning of the data. Combining all domain values into one table makes data management more difficult. Over time, the set of values within most lookup tables tends to change. You also find the occasional odd case in which one code value in the lookup table is used for more than one category in the database. So how do you handle a situation in which the code becomes obsolete for one category of data but not for the others? If you remove the code from the lookup table, it's not available to other active data sets. If you indicate that a code is obsolete for table A, there's confusion in the other data sets in which the code is still valid. But if you don't invalidate the code for data set A, you run the risk of an invalid code creeping into the database. This is what's called a no-win situation.
Designing for Flexibility
Rather than combining all domain categories and codes into one table, I prefer to give each domain its own table defining the valid set of values. I might have lookup tables with names such as State, Region, DocumentType, CustomerType, and ProductCategory. Here is where I deviate from a rule I usually follow, which is to use identity columns as the primary key. Instead, I use alpha codes as the primary key to designate the unique values of the data in each lookup table. The alpha codes are derived from the long names of the domain values, in much the same way as CO is derived from Colorado, KS from Kansas, and AK from Alaska. Lookup tables are good candidates for using a natural or pseudo-natural primary-key value. For each lookup table, I create short, fixed-length alpha codes to represent each record in the table.
Typically, the set of values in a lookup table is highly stable. You might add to the set of values over time, as when the United States added Alaska and Hawaii as states. Some values might become obsolete, but you typically don't want to delete them because they're part of history. If you use the Delete constraint, Restrict option, as Figure 3 shows, to enforce the relationships, you can't delete obsolete values. Instead, you mark them as obsolete so they won't be used anymore.
The Choice Is Yours
Lookup tables visibly enforce domain integrity in a database. When used properly, they help ensure that the data coming into the database is clean. It's not unusual for lookup tables to comprise 30 percent of the tables in your database, so you want to design them in a way that enhances performance and scalability. For that reason, the single-lookup-table scheme isn't one I would willingly use.
Now you know my opinion. Which direction are you leaning?
Michelle A. Poolet ([email protected]) is cofounder of Mount Vernon Data Systems and teaches database design and programming. She has a Masters degree in Computer Information Systems and is an MCP.