The Trouble with Type Tables

If you're a fan of the classic Star Trek series, you might remember the episode "The Trouble with Tribbles." Although type tables aren't nearly as troublesome as were the Tribbles, they can present many problems for developers. So, in an effort to spread the word about the importance of creating type tables and how to set them up correctly, I'll discuss:

What type tables are
Why type tables aren't optional in databases
Typical design patterns of type tables
The logistics behind migrating primary keys

What Type Tables Are

In the relational world, it's typically accepted that most database designs use the third normal form (3NF). Without a long discussion of the practices and principles of data modeling, the long and short of data modeling is to put data only where it belongs. A type table is nothing other than a normalized database table, or artifact, that provides two primary pieces of data: a unique ID and a description attached to a unique ID. The reason you use type tables is to avoid embedding lookup data in the application itself.

In my role as database architect and technical evangelist, I've waged a full campaign with developers to get them to use type tables. This has been an educational process, and through it all, I've heard some pretty remarkable thoughts on the topic. The relational model of database design makes heavy use of type tables. The dimensional model also uses type tables, although they're referred to as dimensions in the OLAP modeling world. A type table is essentially a self-documenting construct. Put simply, type tables hold the domain of values for a foreign key column in a table.

Why Type Tables Aren't Optional

Type tables aren't optional in databases. Why should you clutter your database with dozens, even hundreds, of type tables? Here are the most important reasons.

Type tables provide automatic database documentation. When a column has a hard foreign key constraint back to the referenced table, it's simply a matter of looking in the referenced table to determine the valid domain of values for the column.

Type tables prevent bad data from corrupting your database. I've worked on several large enterprise systems that not only had no data dictionaries to define the tables and columns in the source database but also had no type tables. If a column in a table has no type table attached to it and it isn't clear where this data comes from in the source application, you'll have a mess on your hands. You can't determine what ID "A" means any more than you can determine what ID "0XH0075" means. This is always a recipe for problems.

If you work with reporting and dimensional modeling, you'll fully understand the implications of what this means. For example, if type tables are omitted from an OLTP database and you, as the developer, add new values to the database, the reporting process might break because it has no knowledge of the values you just entered. Therefore, the data has now been corrupted.

The absence of type tables in databases that feed an operational data store or an enterprise data warehouse creates problems for all developers. Type tables make extraction, transformation, and loading (ETL) processes much easier to manage. If you don't add type tables, ETL developers will need to translate your codes into something they can actually use.

Here's how this process typically works: During the ETL development phase, there will be columns with codes in them that need to be translated so that the data can be reportable. There's simply no way around this. If there are no type tables, the ETL developers must either perform the translation in the ETL code or create mock type tables in the data mart or data warehouse to support the raw data. Performing the translation in the ETL code is a very poor design practice, so ETL developers typically create mock type tables. These mock type tables can create further problems. If discrepant data (i.e., data with no matching values in the mock type tables) is brought over as part of the ETL process, it can't be added to the data mart or data warehouse.

The absence of type tables in databases can create problems for customers. Many of the developers I work with are entirely (and in some cases blissfully) unaware that the database with which they're working is replicated, in whole or in part, to the customer's enterprise data warehouse. Missing type tables create problems with customers' ETL processes for the same reasons just stated. Although adding a type table requires some additional coding time and effort, it can save many hours of debugging and rework in the downstream processes that consume the data.

Type tables can prevent other developers from corrupting your data. If you work on a large enterprise team, many developers will work on different aspects of the application and database code -- and they might work on code that you wrote. Because of this, they can inadvertently introduce corruption into the data. For example, suppose another developer puts an invalid value (an empty string) in a column. If there are no type tables, the invalid value will corrupt your data. But if you have a type table, the first time the other developer runs the code, the code will break. At that point, it isn't your problem to resolve. It's the other developer's problem.

Modifying an application's behavior can be as simple as a one-line INSERT statement. No matter whether an application is written in Visual Basic .NET, Active Server Pages (ASP), Java, or any other language, embedding lookups and data lists in an application creates problems because you're putting that data into the application and not the database. When you work in an enterprise design environment, this is even more crucial, because an application might be rolled out across a thousand desktops. Which would you rather do -- add one row to a table or push out an update to a thousand desktops?

Typical Design Patterns of Type Tables

If you're unfamiliar with type tables, you might be wondering how to design them. To begin, there's one school of thought that says the primary key of a type table should contain only numeric values and another that says the primary key can contain character values. After considering the issues on both sides of the fence, I've reached the conclusion that both approaches are valid. What's most important is that the primary key is sufficient to hold all the necessary values (current and future) that a type table might need.

Typically, developers create type tables using character primary keys that are two or three bytes long. Given the size and horsepower of the machines on which most databases run, the overhead of char(2) or char(3) primary keys is irrelevant in the overall design of the database. Here are some other database design standards that developers follow when creating type tables.

Consistent naming convention for type tables. When naming type tables, it's important to pick a naming convention and stay with it. Many developers (including me), follow the standard of using a singular name for the type table (e.g., COLOR_TYPE instead of COLOR_TYPES) because logical modeling follows that convention.

A unique primary key named ID. It doesn't matter if the data consists of AA to ZZ or 111 to 999 -- there are no exceptions to the rule that the primary key must be named ID. This naming practice is not only simple and effective but also helpful when writing re-usable classes. If you always use the generic name of ID, you can write a re-usable class and pass it the name of the type table, without having to worry about what the type table's primary key is named.

A description column called, appropriately enough, Description. The column containing the textual description attached to each primary key must be named Description. Once again, this is a simple and effective naming practice that works well when writing re-usable classes.

If appropriate, a column that lets operators enter a unique, system-specific identifier that isn't part of the primary key. This optional column can help you avoid the situation in which operators enter their own primary key values. They should never be allowed to define primary key values in the database because it opens the door to a host of problems. For example, user-defined key values can get propagated by the application without any hard foreign key constraints. If the user later changes the original value, he or she might inadvertently and unknowingly orphan data elsewhere in the application.

If appropriate, auditing columns for the date and time a record was last updated and the ID of the person who made the change. Including auditing columns might be overkill in some systems, so these columns are optional. However, before you write them off, consider the following situation. Let's say your company offers mortgage loans. Within the mortgage loan system, you maintain a type table of mortgage rates. Although not everyone in your company can modify this data, there are a handful of people who can do so.

Suppose that someone either accidentally or maliciously changes the interest rates from 6.5 percent to 1.5 percent in the type table. This change isn't caught for some period of time, which is disastrous for your company because a lot of mortgages were written for 1.5 percent. If you have auditing columns in the type table, you might be able to determine who changed the data and when the change was made. If you haven't captured this type of auditing data, there are few ways to determine who made this change and when.

Although auditing columns don't speak to the issues of a malicious attack by a rogue DBA (and that does happen) or changes made to the application (and not the type table), they can provide a starting point for your forensic research. In addition, many state and federal regulations, such as Sarbanes-Oxley (SOX), require that you have auditing information like this.

No embedded intelligence in the primary key. I once had a conversation with another architect about specifying a user-defined sort order for some type table data. This architect had proposed that we use a three-byte alphanumeric primary key (e.g., A01, A02, A03), in which the numeric value would be used for ordering and sorting. However, embedding this type of intelligence defeats the purpose of a proper data model, which is to put data where it belongs. In this case, the correct design is to leave the primary key alone and put the sort order in a separate column. This way, there's no ambiguity about what the data means.

The Logistics Behind Migrating Primary Keys

If you're unfamiliar with data modeling, you might be wondering how the primary key in a type table translates to a specific column name in another database table. When you migrate the primary key of a type table to another table, the column assumes what is known as a role name in the table to which it's being migrated. The purpose of a role name is to allow a column in one table to have a different name in another table.

For example, suppose that the primary key named ID in the type table named MY_TYPE_TABLE is migrated to the column named color_type_id in the table named MY_TABLE. In this instance, the role name is color_type_id. Although this column has a different name, it still points back to the original type table's ID column. This relationship is enforced through the defined hard foreign key constraint between the two tables.

In a hard foreign key relationship, the type table is called the referenced table. The table to which the type table's primary key is migrated is called the referencing table. This is the correct way to refer to tables that are related to each other in a foreign key constraint when using data modeling terminology. In other terms, the parent table is the referenced table, and the child table is the referencing table.

It isn't mandatory that a migrated primary key assume a role name in the referencing table. For example, you can have the ID column in MY_TYPE_TABLE migrate to a column named ID in MYTABLE. As long as you don't have another column named ID in MYTABLE, you won't have any problems.

The True Intent of Type Tables

Type tables aren't intended to drive your developers crazy. Instead, they're intended to ensure that your database design is understandable and documented. By correctly implementing a logical design, these crucial tables become quite significant in your database design and architecture.

Comments

Plain text