Redesigning a Catalog

Many developers create databases to manage a limited number of products for a limited business need. You might want to keep track of supplies in the office-supply cabinet, for example, or the total value of your office furniture. However, as this body of information grows to support other business needs, the simple inventory design gets more complicated. And managing the data can grow cumbersome and inefficient if you don't handle the added complexity correctly.

A catalog database might be the most complicated database design on the planet. Part of this complexity comes from the diversity of catalog businesses. My personal experience working with office-supply companies, grocery chains, and retail clothing stores has taught me that no one template can fit all purposes. But the easiest way to explain how to normalize a complex model such as a catalog is to work through an example. So let's look at a simplified solution that has room for growth and expansion and that you can adapt to various needs. I won't try to address the user-interface problems that surround catalog order entry. An order-entry application is only as good as its foundation—the database. If the database is poorly designed and can't accommodate the limitations of the client interface, you can count on continual problems managing the data.

A Problematic Design

Todd, a database developer, needs a well-normalized database to support products in a clothing catalog. An article of clothing in the catalog might be priced the same, be made of the same type of cloth, and come in the same colors as other pieces in the catalog. However, a pair of slacks might come in various waist sizes and lengths for men (e.g., 34/36, 36/38, 38/40) and in various numbered sizes (e.g., 4, 6, 8, 10, 12, 14) and generalized lengths (e.g., petite, regular, long) for women. Todd wants to be able to search the catalog database for all items of a particular color or material type and have the query run efficiently, hitting as few tables as possible.

Table 1 shows the kind of data Todd is dealing with. I show only 20 rows in this example table; an actual product table in a catalog database might contain hundreds of thousands or even millions of rows. Notice the way Table 1 is organized—one row for each variation of each product. The amount of redundant data in the table is staggering. Besides having to deal with the redundancy and resulting waste of hard-disk space, Todd can anticipate running into data-integrity problems during updates and deletes—such problems are always associated with a table in second normal form (2NF). (If you're not familiar with normal forms, you can review the basics in "Why You Need Database Normalization," March 1999, InstantDoc ID 4887.)

Table 1 is in first normal form (1NF) because the data is atomic—one value per cell, based on the meanings that the column headings imply—and because no arrays of values (also called multivalued attributes) are embedded in the table. The table is also in 2NF because each attribute is fully functionally dependent on the primary key, which is the product code. In this table, the product code defines each item in the catalog. The table isn't in third normal form (3NF) because it contains a transitive dependency—a combination of the product description, size and color, and possibly unit price determines the reorder point for each item. We don't know enough about Todd's company policy to determine whether unit price has any bearing on the reorder point for each item. This is a problem, because until we find out how unit price and reorder point interact, we can't fully normalize this table.

One important note about catalog inventory: In a typical clothing catalog, a single catalog number describes a set of products. For example, in Table 1, catalog number PP44888 describes a collection of lady's crewneck sweaters that come in a variety of colors and sizes. Each size-and-color combination is identified by a unique value—the product code—for inventory control. It's possible that a garment manufacturer in Chile makes the pink and white sweaters and a factory in Sri Lanka makes the blue and lavender sweaters. The catalog number is what customers see and use to place orders; the product code is what Todd's company uses to manage inventory.

Table 1 is also badly designed. The column headings are too general, often forcing Todd to string together two or three different attributes in a single column. For example, the column Product Description contains the product category, product subcategory, gender, and material type for each item. A better way to represent catalog items is to design a set of tables that lets you represent each attribute clearly and concisely.

A Workable Solution

Figure 1 shows a conceptual model—an entity-relationship diagram (ERD) of a proposed redesign of Table 1. Sets of products make up the CATALOG entity. Each PRODUCT is defined by attributes in the PRODUCT DETAILS table, is produced by a MANUFACTURER, and belongs to a single PRODUCT CATEGORY. The one-to-many (1:M) relationship between PRODUCT CATEGORY and PRODUCT is simplified; it lets Todd classify product PP44888 under only one product category such as sweater. If this restriction is too limiting, Todd can convert the relationship between PRODUCT CATEGORY and PRODUCT to a many-to-many (M:N) relationship. He can then classify product PP44888 as both sweater and outerwear—and as many other categories as he wants.

I used a 1:M relationship between MANUFACTURER and PRODUCT because for inventory control, Todd needs to track which company produces which product. For example, if two different companies manufacture the same sweater, Todd's inventory records it as two different products. Now Todd can extend this basic database design to track pricing and delivery times because he knows which company produced which lot of lady's pale blue crewneck sweaters. Todd's catalog company probably operates on a small profit margin, so tracking such details is crucial to the success of his business.

The PRODUCT DETAIL entity can represent any number of characteristics about a product. Characteristics such as size, measurement, color, material, and gender are part of the PRODUCT ATTRIBUTE entity. The ERD isn't intuitive because we're used to looking at flat-file representations of catalog data like Table 1 shows. To give you a better idea of what table structures resulting from this ERD would look like, I've simulated a set of populated tables, which Figure 2 shows.

The two most important tables in this example are PRODUCT and PRODUCT DETAIL. The PRODUCT table contains the product code, which Todd assigns as an internal tracking number for inventory control; the catalog number, which identifies each product as part of the set of products that the catalog number represents; the product category, which is essential for organizing the product inventory for a catalog layout; the manufacturer code, which lets Todd easily determine which products come from which manufacturers; the manufacturer's product code, which Todd needs so that he can talk to the manufacturer about a product; and the reorder point, which differs for each product (even though all the products in this example table are sweaters).

The PRODUCT DETAIL table will have the greatest number of rows in Todd's database. Each product attribute appears as a separate row in the table. The pink sweater (the first row in Table 1) is now represented by one row in the PRODUCT table and eight associated rows in the PRODUCT DETAIL table because it has eight characteristics that describe it. Despite what seems to be a one-for-nine trade-off, the redesign is a viable alternative because of the huge reduction in data redundancy in the overall table design and a two-thirds reduction in number of bytes of storage space. (In Table 1, you need 4848 bytes to store data for just the lady's crewneck sweater, catalog number PP44888, in four colors and four sizes—not counting SQL Server file system overhead. You need only 1445 bytes to store the same information as Figure 2 shows.) Plus, even though Todd has to join tables when he searches for all items of a particular color or material type, he'll get better performance because the rows in the PRODUCT DETAIL table are shorter than the rows in Table 1, thus maximizing the efficiency of each I/O operation. For example, if Todd wanted to find all products that come in any shade of pink, he could write a simple query such as

SELECT ProductCode, ProductCategory, CatalogNumber
FROM PRODUCT
WHERE ProductCode in 
      (SELECT ProductCode
      FROM ProductDetail
      WHERE ProductDetailValue like '%Pink%')

Todd can optimize the JOIN operation by creating indexed views and striping the data across multiple spindles. Todd now has the core of a solution to his problem. He'll have to make adjustments for requirements that we haven't discussed, but he has a foundation to build on. This solution is well normalized and minimizes the nonkey data redundancy that was prevalent in the old design. Todd can search the tables for any product attribute with a minimum of coding and by joining just two or three tables. And because the width of these tables is limited, Todd might be able to create covering indexes on them, enhancing data retrieval even more. The new table structure is more efficient for managing and updating data and performs better than the old table. Todd has a workable solution that will serve the company even as it grows.

Comments

Plain text