How Many's Too Many?

Downloads
38656.zip

Were the job of the modern database administrator and developer limited to coding SQL and ensuring good backups, we'd probably all sleep better and have more time to pursue such satisfying hobbies as Thursday night karaoke or hacking the OS of our Furbies. But alas, we must also implement efficient databases in the first place. This task is one of the least enviable in the database world because in no place can you do more good or cause more harm than in designing and implementing a database.

Though the task is daunting, time and experience bring a level of comfort. And you reach a point in your database-development learning curve when you're comfortable implementing many-to-many (M:N) relationships between tables. Sometimes you get so comfortable, in fact, that you overuse these relationships.

Although M:N relationships might intimidate you as a novice database developer, once you've worked with them a few times, they're relatively simple to identify, design, and implement. Usually at this point of mastery, the learning curve hits a brick wall. Few database developers and designers move beyond the "Big 3" relationships—one to one (1:1), one to many (1:M), and M:N—to discover the other types of relationships that can exist in a relational schema. People rarely explore, let alone master, the more obscure relationship types such as tertiary or bill of materials. However, these relationships are simply extensions of the three relationship types you've become so comfortable with. For example, a bill-of-materials relationship is nothing more than an entity that has M:N relationships with itself. In this relationship, a parts entity is made up of other parts, which in turn are made up of—you guessed it—yet more parts. But if you understand M:N relationships, understanding the bill-of-materials relationship isn't much of a stretch.

These less-common relationship types needn't be a mystery to your database schemas. To dispel some of the mystery surrounding these relationships, let's look at the underused and often improperly implemented supertype-subtype relationship, also known as a superclass-subclass relationship. If you've done any object-oriented development, you're already familiar with this relationship, in which several entities share some, but not all, attributes.

Note that in this article, I mainly cover the physical implementation of a supertype-subtype relationship, explaining the rationale for implementing this type of relationship and showing the dramatic performance gains that you can realize. For an in-depth discussion of modeling supertype-subtype relationships at the logical level, see Michelle Poolet's classic article "Supertypes and Subtypes," May 1999, InstantDoc ID 5226.

Rev Up Your Schema

You can see an example of the supertype-subtype relationship in various modes of transportation. Cars, trucks, and motorcycles are all types of vehicles—thus, cars, trucks, and motorcycles are all subtypes of the vehicle supertype. All types of vehicles possess certain attributes such as price, color, curb weight, and so on. However, other attributes, such as bed length, apply only to pickup trucks. Similarly, towing capacity is irrelevant to a motorcycle but a necessary attribute of cars and trucks. Using this example, Figure 1 shows a typical implementation of the supertype-subtype relationship.

In the schema that Figure 1 shows, I've represented the supertype and each subtype as a separate table, resulting in four database tables. The supertype table contains all the columns common to the subtypes, and each subtype table contains only columns specific to that type of vehicle. The NumberOfDoors column, for example, exists only in the Cars and Trucks subtype tables, while the Price column, which applies to all vehicle types, is in the Vehicles supertype table. The subtype tables also share a primary key with the supertype table. For example, in the Vehicles table, if a record with a primary key value of 10 is a truck, the associated record in the Trucks table also has a primary key value of 10. While the relationships between a supertype table and subtype tables aren't always 1:1 in a supertype-subtype implementation, in this example, each vehicle will always be of only one type, creating three 1:1 relationships.

Note that the Vehicles supertype table has a column named Type. This column is a subtype discriminator. It eliminates the need to perform an existence check on the subtype tables in order to retrieve detail information. If the Type column didn't exist, you'd need to check all three subtype tables to locate the record that has the same primary key as the supertype.

At first, you might wonder, why bother with this design at all? Why not just put all vehicle types in one table? Figure 2 shows a one-table implementation of the information from Figure 1's schema. Although you might save a table join by designing the database this way, this table isn't the most efficient from a relational standpoint. The NumberOfDoors field has no meaning for a motorcycle, and although motorcycle shoppers are certainly concerned with seat height, this information is irrelevant to a car. Nor have I ever seen a motorcycle with a truck bed. Consequently, you introduce lots of nulls into the table. Just imagine a supertype-subtype relationship that has 30, 40, or even 100 subtypes that hold 1000 not-in-common attributes. The inefficient storage and page splits you'd cause by having a table with 1000 or so columns could easily cancel out the benefit of saving a table join, especially if the not-in-common attributes use long character fields. Although I encourage you to consider all implementation possibilities, for this example, let's work with the four-table schema.

The ManyToMany Database Schema

Now that we've examined the supertype-subtype relationship, let's see an example of the performance benefits it offers over multiple M:N relationships. The examples in this article use two databases, ManyToMany and SuperSub. You can download CreateAndPopulateDBs.sql, which contains the code to create these databases and the article's stored procedures, at InstantDoc ID 38656. The examples work in both SQL Server 2000 and 7.0, and I've indexed the tables in both databases so that all join columns have an index on them. Also, to ensure that the indexing strategy doesn't affect the results, I avoided filtering result sets in the stored procedures other than on the primary key columns.

The example deals with a fictitious company that leases equipment to other businesses. This leasing company stocks various kinds of trucks, office furniture, and computers. In implementing the database schema for the company's main application, the designer used several M:N relationships, as Figure 3 shows. In this ManyToMany database schema, the Leases table is related through three M:N association tables to tables containing each type of equipment that customers can lease. Each association table records the item being leased and the quantity of the item being leased. (I omitted the Customer table for clarity.)

In a sense, nothing's wrong with this schema. It contains no major normal-form violations. (You could argue that Salesperson in the Leases table might be multivalued, that a truck could be painted two colors, or that in the future, furniture might have no weight; but let's stay focused.) Although the ManyToMany database schema serves its purpose, let's look at some common ways users might query this database and the subsequent performance results.

Listing 1 shows the T-SQL code that creates the stored procedure GetInvoiceDetails, which users execute frequently against the ManyToMany database. The GetInvoiceDetails procedure contains four blocks of code that return details about the equipment on a given lease as well as the grand total of the monthly rate billed for that lease. With the M:N schema, you need to use UNION to join three separate queries to return lease- detail information, as the first three blocks of the stored procedure show. Notice that each query adds a column called ITEM to identify what type of item is being leased. The fourth block retrieves the total monthly amount billed on each invoice by summing the unioned line-item total columns to create a super-aggregated amount. Using a leasenumber of 1, let's execute this query against the ManyToMany database:

USE ManyToMany
GO
EXEC GetInvoiceDetails 1

As Figure 4 shows, each item on leasenumber 1 is listed by type with line-item totals. The final row of the result set is the total monthly lease amount. A lease-detail report is a common request, so this would likely be one of the most-executed stored procedures for this database.

Now, let's look at the STATISTICS IO output of this stored procedure. Run the GetInvoiceDetails stored procedure with STATISTICS IO turned on:

SET STATISTICS IO ON
EXEC GetInvoiceDetails 1

You get the results that Figure 5 shows. (To ensure consistent results, I've already loaded the data into cache by running the procedure several times, so you have no physical reads). This fairly common lease-detail stored procedure generates 35 scans and 70 logical reads. As you'll soon see, this number of reads is much higher than necessary.

The SuperSub Database Schema

Next, let's look at the same leasing-company scenario, this time using the supertype-subtype implementation in the SuperSub database, which Figure 6 shows. Notice how I've removed the three M:N relationships and replaced them with a single M:N relationship between the Leases and Items tables. Three subtype tables—Furniture, Trucks, and Computers—reference Items, the new supertype table. Also, the common attributes Manufacturer (Make), Model, and LeaseRate have moved to the Items supertype table. This design makes the schema much more efficient because the lease-detail information now resides in the Items table, not in the subtype tables. Consequently, the stored procedure to generate the lease-detail report, GetInvoiceDetails2, is much smaller, as Listing 2 shows, and less resource-intensive.

To demonstrate the difference in performance, let's execute GetInvoiceDetails2 against the SuperSub database:

USE SuperSub
GO
EXEC GetInvoiceDetails2 1

As the code in GetInvoiceDetails2 shows, you now need to query only three tables to obtain the same lease report that the first procedure returns. The first block returns item information; the second block returns the total monthly amount billed. The following query produces the same report:

SET STATISTICS IO ON
EXEC GetInvoiceDetails2 1

However, the STATISTICS IO numbers have dropped:

Table 'Items'. Scan count 26, logical 
reads 52, physical reads 0, read-ahead reads 0.
Table 'Leases_Items'. Scan count 2, logical
reads 4, physical reads 0, read-ahead reads 0.
Table 'Leases'. Scan count 1, logical reads 2,
physical reads 0, read-ahead reads 0.

GetInvoiceDetails2 produces 29 scans and 58 logical reads on three tables. That's 6 scans, 12 reads, and 4 tables fewer than GetInvoiceDetails—a significant savings, but you can still do better.

The Icing on the Cake

With the previous two stored procedures, you had to use an additional query to return the total monthly amount billed. Using GetInvoiceDetails in the ManyToMany database, you needed a sum of columns from three tables. Running GetInvoiceDetails2 in the SuperSub database, you needed similar logic because of the GROUP BY statement. However, in the SuperSub database schema, you can now obtain the monthly amount by summing the LeaseRate column in one table, Items. Therefore, you can omit the separate query and simply use the GROUP BY clause's ROLLUP operator. You use SQL Server's ROLLUP operator with the GROUP BY clause in situations such as this, to return grand-total information.

Let's look at the final version of the lease-detail stored procedure, which Listing 3 shows. There's a lot less code than in GetInvoiceDetails, but also note how you can now generate the entire report by using a single SELECT statement that uses the ROLLUP operator, instead of by unioning multiple result sets. ROLLUP produces more summary information than you need, so you can use the HAVING clause to eliminate unnecessary rows from the result set.

For performance comparison, let's look at the STATISTICS IO output from the final stored procedure. Run GetInvoiceDetails3 with STATISTICS IO turned on:

SET STATISTICS IO ON
EXEC GetInvoiceDetails3 1

and you get the following statistical output:

Table 'Items'. Scan count 13, logical reads 26,
physical reads 0, read-ahead reads 0.
Table 'Leases_Items'. Scan count 1, logical
reads 2, physical reads 0, read-ahead reads 0.
Table 'Leases'. Scan count 1, logical reads 2,
physical reads 0, read-ahead reads 0.

From the GetInvoiceDetails original numbers of 35 scans and 70 logical reads of 7 tables, the statistics have dropped dramatically to 15 scans and 30 logical reads of 3 tables—less than half the original total. That's a major performance savings for a common operation.

Now that you've seen how implementing a less-common relationship can generate real performance improvements, I hope you'll begin to question whether the typical and seemingly correct way to implement your database design is the only way—or the most efficient way. Changing the schema of an operational database is a major undertaking and sometimes impractical, but your future database implementations can only benefit from some exploration into the realms of less-common relationships. The "Big 3" relationships are just part of the database-implementation picture; I encourage you not to get a false sense of completion once you've mastered them. And next time you're reading up on database relationships, don't be so fast to put down the book when you come across an unfamiliar one. Your processors will thank you for it.

Comments

Plain text