Multivalued Attributes

Modeling data isn't always straightforward. Novices tend to cobble a model together without regard to data modeling's finer points—starting with a good definition of requirements. If the designer has inadequately defined the requirements, the resulting model often works poorly—or not at all. Untangling a badly designed model takes a lot of work. You have to evaluate tables that already exist in a database, first by determining whether the tables are normalized. A typical normalization anomaly that stops a design cold is a multivalued attribute—a condition in which a table's column has been assigned multiple values. And finding multivalued attributes embedded in tables as lists or collections can be tricky. As I use the terms, a list or a collection is a set of values stored like a series of simple attributes in a table. The members of the list or collection are treated as separate columns in the table, each with a different column name. For example, you can have a list or collection of phone numbers—home phone, work phone, work phone 2, and fax number. Let's look at how to detect a multivalued attribute embedded in a table and how to correct the condition.

Before I discuss multivalued attributes, I need to point out that this column's data models include acronyms that you aren't accustomed to seeing in my columns. In my new (to me) CASE package—Sybase's PowerDesigner—the data types are generic: A stands for character (or alpha) data type, VA for variable character, BL for Boolean, I for integer, and DT for datetime. The <M> indicator in the far right of each entity box means a "mandatory" attribute—NOT NULL in SQL Server terms. Don't worry: PowerDesigner data types really do convert to the correct SQL Server data types when you generate a physical data model.

Figure 1 shows a model of a simple inventory system for computer hardware and software that any company might use. An employee probably designed the model in a hurry to track installed-hardware inventory and software licenses. The model comprises the SERVER table, which contains data about each computer server; the SOFTWARE_PKG table, which contains data about software packages; an associative table (INSTALLED_SOFTWARE), which cross-references SERVER and SOFTWARE_PKG and represents the many-to-many (M:N) relationship between the two tables; and the VENDOR table, which contains a listing of the software-package vendors and is a parent to SOFTWARE_PKG. VENDOR sustains a one-to-many (1:M) relationship with SOFTWARE_PKG.

For the most part, this model is relatively well normalized and will work adequately for a while. VENDOR, SOFTWARE_PKG, and INSTALLED_SOFTWARE are all in third normal form (3NF). You shouldn't encounter any update, deletion, or modification anomalies when you manage data in those three tables. (For more information about database normalization, see Jeffrey Bane, "The First Abnormal Form," August 2001; Solutions by Design, "Responsible Denormalization," October 2000; and SQL by Design, "Why You Need Database Normalization," March 1999.) However, the SERVER table presents a problem. It looks somewhat like a checklist that contains all the attributes that you need to describe a server configuration, but the table is in a prefirst normal form (1NF). You read that correctly—the SERVER table isn't even in 1NF because it contains a list and a collection.

As I use the terms, a list or a collection is a set of values that have been forced to act like attributes. For example, in the SERVER table, Disk1Capacity doesn't truly describe a server's attribute; Disk1Capacity describes a server component's value. Also, as long as each server contains no more than four disks, you can record the disk capacities. But as soon as you upgrade one server to a fifth disk, you face the problem of how to record the new disk's capacity. Do you add the capacities of the fourth and fifth disks together and store the result as Disk4Capacity? Or do you modify the SERVER table structure, then add a new column called Disk5Capacity to the table? If you add a new column and if the database is running with ANSI NULLS turned on, a report that generates total disk capacity by summing Disk1Capacity through Disk5Capacity produces a NULL result for any server that doesn't contain a fifth disk.

Lists and collections might be wonderful for programming against, but they are miserable for data management. For example, if you arrange a server's attributes as Figure 1 shows—in a checklist—you can write a query that simply selects each attribute by name, without joining the SERVER table to another table. This form of "flat-file database" has its proponents. Data retrievals are fast (you don't have to join tables), and data organization is simple. The problem is that data management is very difficult. Extending the table architecture is awkward and might cause negative repercussions in associated objects such as reports. (You might have to rewrite every report that accesses the table that's involved in the re-architecture.)

How do you identify a list or a collection in a table? I define a list as a set of same-type attribute names that have a progressive counter (e.g., 1, 2, 3) appended to the end of the names. Thus, Disk1, Disk2, and Disk3 constitute a list of physical disks. In Figure 1, Disk1Capacity and its fellows are a variation of this naming convention in which the counters reside in the middle of the attribute name, but the progression—1, 2, 3—is still clearly distinguishable. One word of warning: In confirming your suspicions about a list, you need to verify that the progressive counters designate separate objects instead of breaking up an overly long attribute, such as Address1, Address2.

A list is a multivalued attribute that has been flattened so that the list's members can be stored in the same table. To correct that condition, you need to decompose a multivalued attribute into a separate table, then link the two tables in a 1:M parent-to-child relationship. In my example, after you remove the disk list in Figure 1 from the SERVER table (the parent), you convert the list into a table that represents the set of physical disk devices (the child). Then, you have to identify other attributes that describe a hard disk. As Figure 2 shows, SERVER—the parent table—has a 1:M relationship with DISK_DEVICES—its new child table. In DISK_DEVICES, you can capture not only each disk's capacity but also its physical device number, physical device name, capacity units, and the presence (or absence) of a boot drive. By separating device capacity—an integer data type—from capacity units—a char(2) data type—you can easily run a query that calculates total capacity, grouped by ServerName. Figure 2 shows DISK_DEVICES involved in an additional relationship—the VENDOR-to-DISK_DEVICES 1:M relationship. Because DISK_DEVICES is a separate table, you can associate each disk with its disk vendor.

Now that you know how to define, identify, and decompose a list, you can apply a similar process to a collection. I define a collection as a group of similar (not same-type) objects that describe a parent object. You treat a collection the same way that you handle a list, except that you typically need two tables instead of one to fully express the collection. The SERVER table contains a collection of devices—ProcessorType, Memory, CD_ROM, DVD, FloppyDrive, MouseType, Monitor, and NIC. Figure 3 shows that the collection has been removed from the SERVER table and that COMPONENT, a new table, has appeared in the model. The COMPONENT table contains columns in which you store the type of component (e.g., processor, monitor, mouse) and a brief description of each (Pentium III, flat-screen 15", and trackball optical 1.0 PS/2-/USB-compatible, respectively). Because you relate the COMPONENT table to the VENDOR table, you can show which vendor manufactured which component.

Decomposing the SERVER table to remove the collection of components into the COMPONENT table is the first step in the process. To relate the components to the servers they're part of, you have to create a table called INSTALLED_COMPONENT. The relationship between SERVER and COMPONENT is M:N. You know that the only way to express a M:N relationship is to create an associative table—in this case, INSTALLED_COMPONENT—which Figure 3 shows. Now you can write a query that joins the SERVER, INSTALLED_COMPONENT, and COMPONENT tables to find out which servers have optical trackball mouse devices installed. At last, you've removed the multivalued attributes (disguised as a list and a collection) from the SERVER table.

Simplify Your Data Management

Finding multivalued attributes that are embedded in tables as lists or collections can be a challenge. You have to know what the attribute name implies before you can definitely say that a set of attributes constitutes a list or a collection. After you've made that determination, you need to decompose the multivalued attribute into separate tables and relate those tables back to the parent table from which you removed the attribute. Although writing some queries might be easier against the simpler, "flattened" model, you'll be better able to manage your data and control data integrity if you properly decompose these multivalued attributes.

Comments

Plain text