Performance-Tuning the Data Model: Thinner is Better

When you're optimizing a data model for good performance, "thinner is better" is a philosophy that you can apply to your database tables. In this context, "thinner" relates to the number of columns in a table and the maximum byte count in one row (aka row size). Keep in mind that one of the more expensive operations in a database environment is disk I/O—transferring data from disk to cache and back again. All computations, all retrievals, even the simplest queries (e.g., SELECT * FROM pubs.dbo .authors) happen only after data is copied into cache memory. Recall also that the fundamental unit of data storage in SQL Server is the page, an 8KB chunk of contiguous space on the hard disk. (This 8KB translates to 8192 bytes, but after you subtract space for the page header and the row offset areas, the amount of space available for user data is about 8060 bytes. For more information about space usage in SQL Server, see the "Pages and Extents" section in SQL Server 2000 Books Online —BOL.) So generally, thinner tables result in better performance because they use fewer I/O rounds and require less storage space. To help you improve your database performance, I want to share the "thinner is better" concept, tell you what problems are associated with "fat" tables, and show you how to put tables on a diet.

Why Thinner is Better

To optimize your data model and build thin tables, you need to understand the performance benefits of data normalization and how to normalize data. If it's been a while since you dealt with normalization or you'd like to review normalization concepts, take a moment to read "Why You Need Database Normalization," March 1999.

"Thinner is better" doesn't demand that you convert every relation in your data model into a four- or five-column table. What it means is that, through a good understanding of your data and business requirements, you can thoroughly normalize each relation in the data model so that it reflects third normal form (3NF). Then, by definition, your tables will be relatively thin.

"Thinner is better" has a practical side that relates to the number of rows of a table that can fit on a page. The more rows that fit on a page, the more data that SQL Server can transfer from disk to cache memory and back again with one I/O operation. Generally, the less I/Os SQL Server needs for a computation or retrieval, the better the performance.

A Heavy Example

In the first article in this series, "Performance-Tuning the Data Model: Configuring the Database" (July 2005), I introduced two tables, Store and Sale. In this article, I expand on that model, adding to these two tables while I explain the concept of a fat table and how to thin it down. Make sure you change the filespace paths in all the example code to coordinate with your own environment.

The example data model is for a custom furniture shop, Furniture 4 U, which manufacturers furniture on demand. The manufacturing process for this company is grossly represented by the table that Figure 1 shows, BadDesign_ProcessOrder, and by the statement that Listing 1 shows, CREATE TABLE dbo.BadDesign_ ProcessOrder. The BadDesign_ProcessOrder table is an example of a fat table.

When a customer places an order for a piece of custom-made furniture, two things happen. First, Furniture 4 U regards this customer order as a sale and records it in the BadDesign_Process Order table by filling in the CustDesignDate and CustOrderDate columns. Next, the company has to process the order—that is, the company has to make the furniture. Employees record this activity in the BadDesign_ProcessOrder order-process table by filling in the relevant date fields in the table.

It's easy to imagine that when Furniture 4 U first started business, employees used a paper-based system to track the furniture-making production line. In time, they got a computer and used spreadsheets to track raw materials, suppliers, shipping and delivery dates, and production steps. When the company upgraded from spreadsheets to a database, the path of least resistance was to use the spreadsheet layout as a template for the new database table design. This kind of organic growth is how table designs such as BadDesign_ProcessOrder happen.

Fat tables contain more data than necessary for most computations, retrievals, or data management. Typically, data updates involve only a small number of fields from a record, but SQL Server has to move the entire record between disk and cache memory to record the data changes. If the record is fat, the operation is more resource-intensive than if the record is thin. Similarly, most reports retrieve only part of the total number of columns in a table, often joining the retrieved columns to the content from another table. If both tables involved are fat—and especially if they lack appropriate indexes for optimizing the join—the resources the operation consumes can adversely affect the entire database's performance.

In addition to being inef- ficient and uneconomical, a table such as BadDesign_ProcessOrder also violates third normal form (3NF). This table contains attributes that describe the process of an order going through the production line (ProcessStatus, ProcessStatusDescr, Begin ProdDate, BeginFinishDate, CompleteProd Date), and it also contains attributes that describe the type and amount of furniture in the order as well as shipping and delivery information, as the code at callouts A and B in Listing 1 show. A quick calculation tells me that the maximum byte count (row size) of each row in BadDesign_ProcessOrder is about 2800 bytes. Only two rows will fit on each page. Thus, data-transfer rates during record I/O operations will be abysmal.

I've rigged this example to emphasize the way fat tables negatively affect data-transfer rates. The BadDesign_ProcessOrder table has mostly fixed-length character columns. The second table that the code in Listing 1 creates is MoreBadDesign_ProcessOrder, in which I replaced most of the fixed-length character data types with variable character data types and recalculated an average byte count for each row based on an estimate that each variable character field will use 40 percent of the space available to it. The byte count has dropped to 1176 bytes, and now six rows will fit on each page during I/O operations. This data-transfer rate is still not good, but it's better than the previous rate.

Getting the Skinny

The solution is to put the fat table on a diet. If we normalize this table, we'll see BadDesign_ProcessOrder become two thin tables with two supporting reference, or lookup, tables. I don't have enough space to walk you through the normalizing process, but I'll show you representations of the tables before and after normalizing. (For suggestions about how to use this article's techniques to restructure an existing production table, see the sidebar "What to Do About Fat Albert.")

Figure 1 is the "before" picture, the BadDesign_ProcessOrder table containing all data about the production-line process, materials, and shipping and delivery. Figure 2 is the "after" picture. For clarity, I've included the Store and Sale tables from July's article, "Performance-Tuning the Data Model: Configuring the Database." Listing 2's ProcessOrder and ProcessOrderStep tables contain the production-line information. The script also creates the ProcessStatus and ProcessStep lookup tables. As I advised in July's article, I used declarative referential integrity (DRI) to enforce all foreign key references. I included code to create indexes on the foreign key columns, and I'm using the clustering philosophy that I explained in July's article.

Each step in order processing is identified by a meaningful code (ProcessStepCode) that the ProcessStep lookup table describes. The date each step happened (or is scheduled to happen) is recorded in the ProcessStep Date column of the ProcessOrderStep table. If a confirmation number (e.g., a verification number for a shipment of raw materials from a supplier, a tracking number for finished product delivery) is associated with a production step, the company can record that number in the ProcessStepConfirmationNbr column of the ProcessOrder Step table.

The ProcessOrder table stores a quick overview of each order as it moves through production. There's a ProcessStatusCode for filtering on orders that are ready to ship, that have been shipped but not delivered, or that are waiting on parts from the raw-materials suppliers. The ProcessStatus lookup table contains these codes and descriptions. The columns that contain descriptions about what the furniture will look like are still in this table, as the code at callout A in Listing 2 shows. In the future, I might consider relocating these descriptions to a separate table, but I'll have to know more about the business processes and how the company uses the data before I make that call. This table also includes the ShipOption attribute, which records each order's shipping method.

The ProcessOrder and Process OrderStep tables are significantly thinner than the BadDesign_ ProcessOrder table. Wherever appropriate, I've used variable character data types and the same 40 percent fill rate as the old table used. The byte count for the ProcessOrder table is 72, which means that 111 records will fit on a page. The byte count for the ProcessOrderStep table is 236; 34 records will fit on a page.

Furniture 4 U creates many reports that use only the Process Order table or only the Process OrderStep table, especially when the company is reviewing the processes on the production line. For cases in which the ProcessOrderStep table is joined to the ProcessOrder table, the thinner records mean that the join operation will run efficiently and quickly. Thinning down each of the tables vastly improves data-transfer rates. Listing 3 is a cleanup routine that you can use to quickly wipe the example tables off your database when you're through experimenting with them.

Tuning your data model for good performance is a complex task that starts with knowing how your organization's management and users are going to use the data. Once you understand your business and user requirements, you can work on maximizing the effectiveness of data transfer by creating thinner tables. In the process of thinning down tables, don't be surprised if you find you're doing a lot of data normalization. The two just seem to go together.

Comments

Plain text