Performance-Tuning the Data Model: Configuring the Database

Editor's Note: This article is the first in a series about performance-tuning the data model.

Performance-tuning a SQL Server database starts with a sound database configuration and data model. Although you can achieve performance gains by retrofitting a database with various types of indexes and more powerful hardware, you can never completely compensate for a bad data model. Poor database configuration or a bad data model can cause slow system response, blocked or deadlocked transactions, weird or inaccurate results on business reports, lack of data synchronization, inconsistent data, and the inability to write a query that returns the data you're looking for. It's true that a bad data model might not be the only cause of such problems. For example, a database server pushed to its limits might cause slow system response. Or a bad mix of transaction updates from conflicting applications can cause deadlocking or blocking. You should always use due diligence and check for the cause of such conditions. Then, if you can't find an overwhelmed CPU or two conflicting transactions trying to acquire an X-lock on the same data resource, take a serious look at the database configuration and the data model; they might be the source of your trouble.

To create a database that performs the way you expect, you must first look at the foundation the database sits on. Start by optimizing the OS. Then, configure the database to support a good data model and preserve data integrity. When you proactively implement these basic elements, you can prevent many performance problems before they start and set the stage for better overall database performance.

Preparing the Environment

SQL Server performance tuning begins with the Windows OS. Because SQL Server can run only in a Microsoft environment, it's imperative that you stay on good terms with your Windows systems administrator. Two essential settings to pay attention to on the database server are the file system and the page file. For SQL Server, you should use the NTFS file system. NTFS is more robust and secure than FAT, even if the FAT file system (arguably) performs slightly faster on writes. When you configure your page file, remember that the rule of thumb for virtual memory is to assign a static size of 1.5 times the size of physical memory. Also, if any server components such as a network card or hard disk go to sleep after a certain period of inactivity, reset them (or have your systems administrator reset them) so that they never sleep. You don't want to risk having to cold-boot to wake up a sleeping computer component. In a multi-protocol environment, make sure that TCP/IP is at the top of the protocol stack. And if you have a slow network, ensure that the default login timeout is set longer than the time it takes to log in to applications that use the database.

In addition to preparing your OS for optimal support of SQL Server, be sure you make the environment fault tolerant. For reliability and speed, I recommend using a RAID environment for SQL Server. Although RAID setups can be expensive, you can make the best use of your RAID budget by carefully selecting the best type of RAID for your environment and by deciding what parts of your database are most important to protect. For an explanation of different types of RAID systems and their benefits, see the sidebar, "Running SQL Server on RAID."

Build a Solid Foundation

SQL Server performance tuning relies on a sound database configuration. Design your database for optimum performance by separating into individual filegroups the major types of data that you'll be storing in the database. Separate system tables from user tables. Separate data from indexes. Separate tabular data from image, text, and ntext data. Using this scheme for separating data into multiple filegroups creates a highly scalable database. Scalability is the ability to increase the number of transactions processed without diminishing performance. For a small system, you can put all the filegroups on one drive (with the exception of the transaction log file, which you should always put on a disk separate from all the other data). As the system grows, storing more data and supporting more users, you can move the various filegroups to separate drives, thus dividing the workload across multiple disks. Separating your database into multiple filegroups can also help you manage database backups. You can use filegroups to enable backups of very large databases (VLDBs) within the database backup time window. (For more information about using filegroups for backups, see Kimberly Tripp's article "Before Disaster Strikes," September 2002.) And you can use filegroups to facilitate horizontal partitioning, which I wrote about in "A New Lease on Life," September 2003. When you're designing a database for performance, filegroups are a useful tool that can help you avoid problems before they start.

The code in Listing 1 shows the standard way I create a database. Each filegroup has three names: a filegroup name, a logical file name, and a physical file name. You can see these names if you open any database properties window, then choose the Data Files tab. On that tab, the entries in the Filegroup column correspond to PRIMARY and the filegroup names that you see in the first part of Listing 1. The entries in the File Name column are the logical file names, and the entries in the Location column are the physical file names, which include a full path to the location of the physical files on your hard disk. Put SQL Server system tables in the PRIMARY filegroup. Put user tables and indexes in their respective filegroups, separate from the SQL Server system tables. Put image and text data in its own filegroup, as the code at callout A in Listing 2 shows.

The configuration options that follow the CREATE DATABASE command are set to conform to the ANSI SQL-92 standard. You might have to modify one or more of these configuration options based on your own shop standards. Check your existing configuration options to ensure that these will work in your environment.

Integrity Is Integral

The next step in configuring the database for performance is to let the SQL Server enforce referential integrity. For example, the Store and Sale tables in Listing 2, which are adapted from the pubs database, have a dependent relationship. A Sale can't happen without being associated with a Store. Referential integrity means you enforce this business relationship in one of two ways. You can assign the enforcement task to an application outside the SQL Server, or you can let the SQL Server take care of enforcing the rule. In my opinion, enforcing static rules such as the referential integrity between Store and Sale is best left to the database. You code the rule one time, as the code at callout B in Listing 2 shows. Then, SQL Server enforces the rule for all the database's users. If you try to enforce a rule with an application, the rule might be left out of subsequent versions of the same application, or the rule might be overlooked by other applications that access the same data, leading to violations of referential integrity and possibly corrupted data.

In addition to enforcing referential integrity within SQL Server, I suggest that you use declarative referential integrity (DRI) rather than triggers or stored procedures. DRI is a constraint, and constraints execute more efficiently than triggers or stored procedures, especially if you'll be using bulk-loading operations to transfer data into the database.

Listing 2 shows how to establish DRI between the Store and Sale tables. As I mentioned earlier, these two tables are adapted from the pubs database. I've modified columns and included parameters that I feel are necessary in a good production environment, plus I've added some columns to better illustrate data-model performance-tuning concepts. The CREATE TABLE statement for the dbo.Store table adds a new column, StorePhoto, which is an image data type. Notice that in the last line of this CREATE TABLE statement I've directed the data to be stored in the MyDatabase_data filegroup, but the associated store images will be stored in the MyDatabase_image filegroup.

Foreign keys. I always create primary key and foreign key constraints separately from the CREATE TABLE statement because I want to be in control of the constraint names. You can designate a column as a primary key or a foreign key when you create the table, but I don't do that. I prefer separate ALTER TABLE statements for the primary and foreign key constraints. When you're designing and developing a data model, change is your constant companion. Therefore, you must be able to quickly and readily identify and reference the various constraints.

The default constraint names that SQL Server assigns aren't too difficult to interpret. For instance, FK__sales__stor_id__0AD2A005, the name of the original foreign key constraint between Stores and Sales in my copy of the pubs database, is obviously a foreign key in the Sales table that uses the stor_id column. However, the name of the constraint doesn't tell me what table it's referencing. My naming convention, FK_Sale2Store, is more succinct, and although my name doesn't contain column information, it clearly tells me that Sale is dependent on Store. In addition, I don't have to execute a query to get the full constraint name. I need only to follow the standard that I've established for foreign key constraint names. Any foreign key constraint will start with FK_, followed by the name of the dependent table, the number 2, and lastly the name of the independent table.

Indexing. SQL Server doesn't automatically create an index for foreign key columns (as it does for the primary key of a table). In production, you can count on using the primary and foreign key columns for join operations. Therefore, when you create a new database, you should create an index for each foreign key column in any dependent table. Typically, these indexes are nonclustered. I'll talk more about clustering in future articles, but suffice it to say that you cluster on the column or columns that will be used most for data retrieval. You might even wait until your database is near or in production before you make decisions regarding clustering.

For an example of this clustering philosophy, take a look at the clustered index at callout C in Listing 2, which I built on the Sale table using StoreCode plus SaleID. Sales reports will run daily or hourly and will be sorted first by store (represented by StoreCode), then by SaleID. The SaleID values increase throughout the day (SaleID is an identity), so they are a defacto entry sequence number; both they and the SaleDate values increase with each new sale. Most of these reports will be summary reports with totals calculated by store. StoreCode plus SaleID is an excellent candidate for clustering. The clustered index will speed up the data returned because the data is already organized and ordered—in this case, ordered first by the Store Code and then, within each value of Store Code, by SaleID. StoreCode by itself joins the Sale table back to the Store table. Having two indexes—one clustered, one nonclustered, both starting with StoreCode—might be a redundant design, but once the database is in production, you'll have a chance to see how the SQL Server will use (or not use) these indexes. If the SQL Server doesn't use the nonclustered index, you can safely drop it. During the design phase, however, I suggest you create a separate index for each foreign key column, even if it means some initial indexing redundancy.

If possible, define default and check constraints in the database rather than at the application level for the same reasons I suggested you let the database manage referential integrity. Because constraints are database objects, they execute quickly and efficiently, more so than code stored in external application logic. If the rules that can be expressed by default and check constraints are static—that is, they're not changing dynamically or every week or even every month—you can define these rules as column or table constraints. In Listing 2's CREATE TABLE dbo.Sale statement, the column SaleDate has a default value of CURRENT_TIMESTAMP. This rule probably won't change for the life of the database, so it's a good candidate for defining as a column constraint. The next rule, SalePayTerms, is expressed as a check constraint. SalePayTerms can't be NULL, and initially, the values defined for SalePayTerms are Net 30, Net 60, and On Invoice. An application can't override this set of values. If you need to add a value to the set, you'll need to modify this column constraint. If you suspect that this set of values will be frequently modified, a more appropriate way to implement the constraint would be to create a lookup table of sale payment terms, give each row in the table a unique value as an identifier, and relate the lookup table to dbo.Sale, in a one-to-many (1:M) relationship. The relationship between the lookup table and dbo.Sale would be a dependent relationship, enforced in much the same way as the relationship between dbo.Store and dbo.Sale.

Database performance tuning is a huge, vital topic requiring understanding of the database environment and a lot of hands-on experience to fully comprehend how to optimize your database. A good place to start with performance tuning is at the beginning, by understanding your data model and the business rules and requirements. Only then can you maximize the data model for high performance.

Comments

Plain text