Database Design for Performance

Building and operating a performant SQL Server-based application requires strong process and engineering discipline. Many approaches to performance engineering—techniques such as bottleneck analysis and wait-state analysis—apply only after the system has been developed, when it's more challenging and expensive to address design problems. But if you create an ongoing performance-engineering process, you can anticipate certain behaviors and characteristics, applying a variety of design techniques and monitoring and maintenance techniques that will ensure proper design as well as smooth running operations.

Of course, the holistic process includes more than just technical steps. You need to include steps for covering formal release processes to ensure that you promote only tested applications to production; recommending that you have a forum with users for discussing upcoming business events that will affect the application; and insisting that you document any installation or operational actions to ensure good performance from the get-go. But when you get down to planning your database-design process, you need to have a laundry list of essential technical items. In this article, I cover the items you need on that list. In an upcoming article, I'll cover the essential pieces of setting up performant indexes, maintenance, and statistics for your environment.

The Life of a Query

If I were to oversimplify the steps to good database performance, I'd look start by looking at the "life of a query" and think about the process of statement execution. Fundamentally, the key to great performance is to help SQL Server minimize the time it takes to save or retrieve your data. In addition, you want to minimize the overall resources used and reduce response time. In simplifying this idea, I focus on four primary performance areas: statistics, maintenance, indexing, and database design. After a system is in production, you can make changes in some of these areas with little trouble. Other areas require schema changes and might be difficult to do at any time other than during initial database design or during a major revision of your database application.

These four performance areas flow into each other naturally; statistics are updated when maintaining a database, having the right indexes ensures that the maintenance will be effective, and an appropriate design will simplify index design and even make specific indexing features available. But database design is really the foundation on which you build. Some performance problems can be helped with indexes, maintenance, or statistics, but a truly scalable system must start with good design.

Application design traditionally consists of two steps: you develop a logical model of the business process you're automating, then you map that model to the database by creating a physical model, which is implemented as a series of tables. (To learn about database modeling, see Related Reading, page 14.) You can choose from many modeling techniques. In my experience, all modeling techniques provide a means to interact with the system's users in terms they understand, but all fall short when it comes to efficiently mapping to the physical database. I think that this gap exists because all the methodologies tend to be database agnostic, but to get true performance, you have to understand how to leverage the specific features provided by the underlying technology. Equally true, you must understand the system and how it's being used.

For example, consider the following scenario. At a conference, someone asked me how I would index a specific entity relationship diagram (ERD). I could look at the tables on that ERD, and if I found a Customer table, I could suggest an index on CustomerName because it's likely that you look up data based on a customer's name. However, if you add the information that it's your demographics analysis database and all sales are analyzed by city-state-ZIP code combinations, my suggested index on CustomerName would be useless and a burden in terms of space and performance (if you were to analyze data loads and insert rates).The moral of this story is that you can't design a highly performant database application in a vacuum: You must know your data, know your users, and know your system.

Database Design

When you set out to design your database, you'll quickly discover that you need to consider a lot of pieces. Table design, row size and page density, correct and consistent data type usage, column nullability, how to deal with LOB data, and data integrity are all key design aspects that you need to deal with in your process.

Let's take a look some of the most important details you'll need to assess in each of these areas. As with all things technical, my recommendations in each of these areas represent the tip of an iceberg. You'll fill in the gaps in your process by making mistakes, monitoring for performance, and deriving new solutions.

Table Design

Tables can be "flat" (too denormalized), similar to a spreadsheet containing lots of redundant data. Likewise, tables can be overly separated (overly normalized), requiring references to dozens of tables to access the information you need. Tables can be tall (in terms of the number of rows).Tables can be wide (in terms of the number of columns and column width).You'll experience problems with all of these types of tables and each type can provide benefits.To resolve some of the most common dilemmas about table design, follow this sequence of questions:

Is your table too denormalized or normalized?
Do you have only one table?
Do you have a lot of redundant data?
Do you find that many attributes have exactly the same values for multiple columns and multiple rows?

Is your table overly denormalized? Having some redundant data can be beneficial. However, to minimize the number of tables joined, the only data that should be redundant should be relatively static in nature and as narrow as possible. Ideally, I would just consider making some relationships redundant and not necessarily making detailed attributes redundant. For example, imagine a LineItems table that describes the items in each customer order. Each LineItem describes a product sold, and each product has a manufacturer. In your design, you've made LineItems, Products, and Manufacturers separate tables.You reference the ProductID in LineItems and the ManufacturerID in Products. However, when you do some routine analysis, you find that you're generally aggregating sales by manufacturer so that you can stock and sell more items from Manufacturers whose products sell more. If you add the ManufacturerID to the LineItem table, you'd be adding an extra column, but you'd completely eliminate the join to Products when you're doing your analysis. Additionally, even when attributes from the Products table (e.g., ProductType) are necessary, the optimizer has more options for how to process the three-table join when all join conditions are stated than when only the two possible join paths are available.

Is your table overly normalized? If every query requires multiple joins to get basic attribute information, you might have gone too far. Purists will argue that using an external code as a data key is a mistake because such keys are outside of the control of the system and might change.This argument is especially true in a large company in which different departments consume each other's data. However, ISO "country codes," for example, and other standard coding systems are safe to use directly because it's more likely that any change will affect the system, regardless of the original design. To learn more about database normalization, see Related Reading.

Is your table too tall? Having a lot of rows isn't necessarily a problem.With appropriate indexes and targeted queries, you can easily evaluate large datasets. (Well, you can if you have decent disks and lots of memory. Definitely target 64-bit machines if you're a VLDB data warehouse.) However, the most important clues for determining whether a table is too large are administration, maintenance, and management—not necessarily query performance. Really, it goes back to how the data is used. If the data follows a consistent pattern in loading or archiving (e.g., every week, month, or quarter, data is loaded from one or more OLTP systems), then you have a fundamental problem in efficiently managing that load and archive. The problem is so common that it's got multiple names: the sliding window scenario or the rolling range scenario. Loading a large amount of data into an existing table can be expensive, mostly in terms of the effect on the secondary non-clustered indexes.

Data loads are slow and result in a tremendous amount of fragmentation.As an alternative in some design systems, the indexes are dropped, the data is loaded, then indexes are added back. However, this technique is unrealistic if the existing table has 87 millions rows (2 years worth of data) and you're bringing in only 3 million rows from the last month's worth of activity. So, how do you make the load faster and keep the table intact and contiguous? Consider horizontally partitioning the data so that you can have more granular control of your data.

Is your table too wide? This important point leads us into the next section about row size and page density. Row size matters: It directly affects query performance and I/O characteristics. A rule of thumb is that having more attributes isn't necessarily better than having fewer attributes. Certain attributes might even reduce the possibility of online index operations. (To learn about table partitioning, see Related Reading.)

Row Size and Page Density

Without spending a tremendous amount of time covering internals, I want to convey a few key things you can learn from knowing a bit about the on-disk structures of your tables. Generally speaking, rows don't span pages. Specifically in SQL Server 2005, rows can span pages but only when the rows include limited variable columns. Don't confuse these columns with the new max types or LOB data; these are the NON-LOB data types such as varchar(n), nvarchar(n) and varbinary(n), in which each column width is less than or equal to 8000 bytes. Having said that, I want to emphasize that although SQL Server 2005 allows a 10K row, for example, you shouldn't necessarily have one. Internally, this row will consist of an "in_row" structure as well as an "overflow" structure. Rows that are less than 8K will be stored entirely as "in_row" structures, and the "in_row" structure of a row can't span pages. For example, if your "in_row" row structure averages 4,500 bytes per row, you can fit only one row per page. This limitation results in roughly 3,596 wasted bytes both on disk and in memory. If the attributes that make up the 4,500-byte aren't required for all requests, consider breaking this data into two or even three tables based on column usage. You might end up with more densely packed pages, less locking and contention, and better overall resource utilization. To learn more about row size and physical database storage, see Related Reading.

Correct and Consistent Data Type Usage

In addition to working hard to isolate your columns and vertically partition for better column grouping, you should also decide what's the best data type for any given job. If you need a date but you don't need a precise time, or if your dates are all relatively recent, consider using smalldatetime instead of datetime to save 4 bytes per row. If you have multiple dates, you could save considerable space over the course of many rows.

Additionally, keep your data type choices consistent throughout your application. If your stored procedures use different (but "compatible") data types, SQL Server might have to do an implicit conversion during processing. This implicit conversion might require a data (or index) scan, whereas SQL Server could have used an index more efficiently (through a seek) if the column didn't need to be converted. SQL Server can certainly handle some of the implicit conversions required by the use of different data types, but explicitly converting or consistently using the same data type in all references will eliminate confusion and result in better performance.

For example, say a member table contains a varchar column that stores last name values. This column is indexed, and the last name Tripp is highly selective (there are only two rows that match the value Tripp).When you execute the two queries in Listing 1, you get the showplan output that Figure 1 shows. First, notice the batch cost of these two statements.The first statement is only 6 percent of the cost of the batch and the second statement is 94 percent. The only difference between the first and the second statement is in the implicit conversion, as the showplan tooltip shows.You might want to consider profiling your server and capturing the showplan XML events to filter for CONVERT_IMPLICIT. This technique might help you narrow your search for poorly written applications, stored procedures, and ad-hoc statements that aren't consistent in terms of search arguments and data types.

Figure 1: Showplan output that reveals batch cost

Column Nullability

Although nullability no longer affects the physical structure of your table (SQL Server versions before 7.0 used to track nulls by using a single space), nullability can still cause additional work in terms of using functions to change, replace, or ignore null values. I'm not recommending that you avoid nulls, I just want to stress that you should use them only when they make sense and when your data truly requires tracking the "third state" in your columns' domain of legal values. Also, the architecture of the null block (which is how SQL Server 7.0 and later tracks nulls) changed between SQL Server 7.0 and SQL Server 2000. In SQL Server 2005 and 2000, the null block has one bit for every column, regardless of whether that column allows nulls.The null block is at least one byte, and for bits that don't represent real columns, the value will be 1. If the bit represents a real column and the column value is null, the corresponding bit is set to 1.

For optimal performance, consistency is crucial. If you're creating scripts, the state of the column's nullability isn't a required table option (for the CREATETABLE statement) and as a result, the nullability is determined by session settings. The two session settings that might determine whether your column allows nulls are ANSI_NULL_DFLT_ON (which can be ON or OFF) and ANSI_NULL_DFLT_OFF (which also can be ON or OFF). Note that these two session settings can't be ON at the same time. However, both can be OFF at the same time. If both are off, then the database option ANSI null default will be used—if this option is ON. If all session settings are off and the database option is off, then the column won't allow nulls.The obvious problem with all of these rules is that they're complex (please don't bother trying to remember this). And more importantly, your script results might not be reproducible between systems, environments, and executions. Leaving the nullability option off also means that you might have data integrity problems with complex scripts and that you haven't clearly decided whether your columns should allow nulls. Column nullability isn't something that you should leave to chance.

LOB Data on the Data Page

SQL Server 2005 and 2000 both give you the ability to store text, ntext, and image data "on page." By storing the data on page with the in_row data structure, SQL Server could avoid an additional lookup into a separate table structure, thus reducing I/O. But you end up with the problem you have when your rows are too wide. Do you really need to use the LOB data on every request? Is it really beneficial to widen rows for LOB data if it's not often used in queries?

In SQL Server 2000, the default is that the data is always off page unless you use sp_tableoption to set the text in row table option; this option requires several bytes that dictate whether the LOB value is stored in the data row. For example, the string

sp_tableoption , "text in row", 2000

says that LOB values of up to 2000 bytes should be stored with the "in_row" structure.This setting might be useful if you have a LOB value that's almost always returned by every query and when the LOB value is typically under 2000. However, if the LOB value isn't frequently used, this option makes the "in_row" structure unnecessarily wide (for most requests) and can reduce the effectiveness of the cache by filling it with the unneeded data. As a result, I rarely recommend this option.

In SQL Server 2005, Microsoft introduced four new data types: varchar(max), nvarchar(max), varbinary(max), and XML. The first three data types replace text, ntext, and image, respectively. XML is a new type. Although the new "max" types are meant to replace the old types, they also introduce some new behaviors. First, by default, SQL Server 2005 automatically stores the new types with the in_row structures if the sum (in bytes) of the LOB value and the in_row structure is less than the in_row maximum of 8060 bytes per row. As in SQL Server 2000, if the LOB value is consistently (and relatively) narrow (in number of bytes) but infrequently queried, you might want to consider setting the new option as

sp_tableoption , 
 'large value types out of row', 1

to force SQL Server to store these values off the page. However, it's important to know your data and know your users in this case because data access patterns are crucial for dictating the best choice.

Finally, although SQL Server 2005's new LOB data types simplify usage—you use them much like a variable column with the same access and function support—they have one additional side effect when you use them in an index. SQL Server 2005 allows any column to be in the leaf level of an index—including LOB types. If a LOB type is in the leaf level of an index, the index won't support online operations.You would very rarely want a LOB column in the leaf level of a non-clustered index, but if you've followed some of Kalen Delaney's internals articles over the years, you know that the leaf level of the clustered index includes all columns of the table. Simply put, if a table has a LOB column, that table's clustered index doesn't support online operations. For this reason, I would once again consider vertical partitioning. For cases in which minimizing downtime is a top priority, consider creating one table that holds critical data and a second table of infrequently accessed columns and LOB columns. This technique will let you keep your more crucial columns online even when you're performing some maintenance operations.

Data Integrity

People store a table's data integrity rules in the database for many reasons, particularly centralization and consistency. However, I've often heard the argument that data integrity rules add overhead and that it's not worth the performance cost to store rules in the database. My response to this argument is that there's no free lunch—you have to pay for data integrity somewhere. The main problem I have with storing data integrity rules outside the database is that often, application integrity costs aren't measured the same way that database inserts are measured; as a result, when business logic is moved to the client or middle tier, the server appears faster. So, although you might see better server-side throughput, you should realize that the work is still being performed, you're just distributing it.

To some people, this distribution is really the point. In the end, you might be able to handle more requests by taking some of your business logic off of the server. However, you might also end up with data-integrity problems when you upgrade the application or access your data through applications that don't include this business logic, such as SQL Server Management Studio.You might end up paying for these gains by spending extra time repairing bad data.

So, although some arguments for storing the business logic elsewhere are good, I strongly suggest storing data integrity rules as close to the data as possible. If data integrity is important in your environment, you must consider constraints. The most costly constraints are foreign key constraints because they must do data integrity checks when rows are inserted or updated in the referencing table as well as when a referenced table is updated or deleted. Verifying a foreign key value on the insert or update of a row in the referencing table is relatively easy because a foreign key can reference only primary or unique keys. (Primary and Unique keys enforce uniqueness through the creation of an index, which makes the verification of the referenced value fast.) However, if a record must be updated or deleted from the referenced table, SQL Server must perform a reverse lookup. SQL Server must verify whether any rows reference the value being updated or removed. In this case, you must check the column that contains the foreign key to see whether any rows contain that value. If the column doesn't have an index, the data values must be scanned. Maintaining this relationship (without indexes) can be very costly. As a tip, consider manually indexing your foreign key columns. This technique will not only minimize the cost of maintaining the primary or foreign key relationship, but it might also help improve your join performance in some cases.

Working from the List

I've given you a first look at a database-design laundry list for taking advantage of features that can offer many performance gains and help you avoid problems you might not have even realized you'd have. Designing your database for performance and performance engineering can provide huge rewards for your business, and using the resources you have—efficiently—is the best way to reap those rewards.You just need to be willing to invest the time to apply these basic principles to your current applications and database operations processes, hone your skills through experience and research, and document and pass on your skills to others in your organization. In an upcoming article, I'll cover the technical items you need to have in your indexing, maintenance, and statistics lists.

Related Reading

SQL SERVER MAGAZINE RESOURCES TO LEARN ABOUT DATA MODELING:

TO LEARN ABOUT DATABASE NORMALIZATION:

MICROSOFT RESOURCES TO LEARN ABOUT TABLE PARTITIONING:

TO LEARN ABOUT ROW SIZE AND PHYSICAL DATABASE STORAGE:

Comments

Plain text