Feng Shui and Database Design

Downloads
23707.zip

Applying principles of an ancient art can help you achieve balance in your database design

The premise behind the ancient practice of feng shui is that the positioning of objects in your home or office can either impede or encourage the flow of energy in and around you. Flowing energy, called chi, can improve your creativity, your energy level, and your overall outlook on life. Impediments to the flow of energy or objects that send that energy in the wrong direction can have negative effects on the world around you. Similarly, after 20 years of programming, building, and designing databases, I've noticed that data has its own energy. When a database design is appropriate for the demands placed on it, data flows easily. When a database design is missing important components such as indexes, the action of one user can disrupt the flow of data for many users. Let's examine some of the design choices that can help you improve the flow of data in your databases.

Find the Proper Orientation

The orientation of the user to the data is a fundamental concern in database design. Online transaction processing (OLTP) applications typically perform many insertions, updates, and deletions, whereas analysis applications (including OLAP applications) perform mostly SELECT queries. In other words, OLTP applications change the data; analysis applications read the data. These two operations have always been fundamentally opposed to each other.

The main way that opposition manifests itself in SQL Server is in lock contention. OLTP applications must use exclusive locks to maintain data integrity; however, those locks disrupt the flow of data to analysis programs. Many of the improvements to SQL Server's engine over the past 5 years have focused on decreasing the time that exclusive locks stay in place, the speed with which they're imposed and removed, and the granularity of what they lock. The idea behind changing from page-level locking to row-level locking is that you can reduce overall contention for data by spreading the locks across more discrete units. Whether this idea is valid depends not only on the database's size relative to the number of concurrent users but also on how applications manipulate the data.

As an example, let's look at what might happen if you used the Northwind database to support Northwind Traders' e-commerce Web site. As you might expect, most of the records will be in the Order Details table. Given the relationship between Orders and Order Details, you'll have several rows in Order Details for each row in Orders. The rows in Order Details are 22 bytes each, so if you use page-level locking, the system could support as many as six (2155 rows ÷ 368 rows per page) users simultaneously making changes that require an exclusive lock. Similarly, while many hundreds of users might be able to read rows out of the table, a single shared lock could prevent just as many people from being able to change their orders. As the number of rows in Order Details increases, the potential number of concurrent users increases proportionately because the probability that two users will want the same page decreases. In this way, growth in a table can actually improve data flow.

Row-level locking is a significant improvement over page-level locking because locking affects smaller sections of the table (i.e., one lock affects only one row instead of all 368 on a page). However, the level of contention will still increase as the number of users grows. Row-level locking just increases the number of concurrent users because the locks are more granular and more people can hold locks on individual rows with less chance of interference. However, as a table's row size grows toward the page size, row-level locking loses some of its benefits because the locking granularity approaches that of page-level locking. Therefore, in systems with row-level locking, heavily used tables should have the most possible rows per page.

Additionally, the server itself can be a potential impediment to data flow. As the number of users increases, so does the demand for space in the data cache. The load on the hard disk subsystem also increases as SQL Server searches for specific records. If Northwind Traders doesn't regularly archive rows out of the Order Details table, the time to search for rows and load them into memory will increase linearly until the hard disk subsystem reaches capacity. When the disk subsystem exceeds capacity, the search time tends to increase logarithmically because of delays in transferring data from the disk to memory. In this case, the constraints of the server's hardware can limit even the best database design.

The fact that the Products table contains a column for storing the current inventory levels is a different sort of problem. The concern is that Products will be a part of nearly every operation the database supports. The SELECT query issued by someone browsing for items to buy will prevent an update of the inventory levels by someone placing an order. Orders being finalized also cause a problem for everyone on the system, and someone running a long sales report could prevent a large number of customers from being able to complete their orders.

Realtime inventory management is an age-old problem with many different solutions. Consider what happens when a customer on the Web site places 100 bottles of rare wine in a shopping cart. If the application reserves those bottles by decreasing the amount in the Products table, other customers who want that wine might find that the inventory on hand is less than they need and go somewhere else to shop. If the application adds the item to the cart but doesn't reserve the inventory, customers might buy more bottles than Northwind Traders has in stock. This business decision is quite difficult.

Thus, business decisions often affect the database design, too. In this case, if Northwind Traders opts to reserve the inventory by decrementing the UnitsInStock column in the Products table, the exclusive lock required to perform the UPDATE statement will block the data flow for everyone attempting to buy that item. If multiple customers attempt to reserve multiple items at the same time, you have a great risk of a deadlock condition, which stops data flow for everyone involved.

This blocking problem arises because the Products table is central to so many operations; the solution is to eliminate the single point of failure. One approach is to permit dirty reads—read operations that ignore the exclusive lock. For the Products table, letting customers see inconsistent data isn't a problem because most of the columns contain static data, such as product name and supplier name. You need to be careful only when users view current inventory levels. The exclusive lock placed during the UPDATE statement that changes the inventory levels serializes access to the row on a first come, first served basis, but allowing dirty reads will let customers see inventory levels that might be greater than what's actually on hand. If too many people attempt to order too much of one item, inventory levels could even drop into negative numbers. Although many applications use negative inventory to indicate how many items are on back order, items such as rare wine can't be reordered. In most cases, you can resolve the situation by using an UPDATE trigger to check for negative inventory numbers. How you handle this type of problem is a business decision, and it will affect your overall database design because it affects the meaning of the values in a table.

Understanding the orientation of the user to your data is important. Knowing how many people will use the database simultaneously as well as what types of operations they'll perform, how often, and on what data will help you maximize data flow in your database design. Typically, bad designs come from either ignorance of or bad guesses about these factors; therefore, they should be the first things you consider when you start the design process.

Find the Right Balance

Although users will have the biggest effect on your database design, the choice of hardware will significantly affect its ultimate success or failure. Slow hardware can exacerbate a bad design, whereas fast hardware can mask many problems. Unfortunately, many programmers depend on fast hardware to save them from having to think too hard about the databases their applications use.

As an example, one of my clients, a Web site hosting service, originally stored all its customer and dynamic Web content in a Joint Engine Technology (JET) database, which its Active Server Pages (ASP) application accessed through ADO. The database was a complicated mess of interrelationships between tables because of a complex commission structure and because all the elements of the users' Web pages were stored in the database itself. When the number of users grew from 1000 to 5000 very quickly, the client moved the database to a single-CPU server that was twice as fast. That server worked until the number of users approached 10,000; then, the client moved the database to a four-CPU Compaq server with a six-disk RAID 5 array. The database was only about 20MB, but its design was so poor that users were interfering with one another. The inefficient locking mechanisms that JET uses were a problem, too. Moving to a faster machine just meant that each operation finished faster, which resulted in each user using less physical time. This approach worked in much the same way that a new lane on the highway helps lessen rush-hour traffic jams.

When the number of users reached 15,000, my client was forced to move the database to SQL Server and build indexes that were appropriate for SQL Server. We left the database design alone until the number of users reached 60,000. At that point, the application could handle approximately 1000 concurrent users updating the client's Web pages—an unacceptably low percentage of the total. Because the database was already on a quad-CPU system that was clearly not overutilized, moving to a faster server wasn't going to help. Even moving to multiple servers wouldn't help because the impediment to data flow was in the database itself, not the hardware. The only choice was to change the database design. The lesson I learned from that experience is that although hardware can mitigate problems that a bad database design creates, fixing even well-established databases ultimately becomes more cost-effective than adding memory, disks, or CPUs.

A common misconception is that full normalization of the data is the ultimate goal of database design. Normalization has the benefit of ensuring data consistency and minimizing redundancy, but if taken to the extreme, it can have the side effect of impeding data flow from the database server.

Going back to the Northwind Traders e-commerce Web site, the query that Listing 1 shows is a likely choice for a Web page that shows a customer all orders he or she has placed. As a rule, any time you can remove a table from a multitable SELECT, you improve the performance of the query. In this case, the ProductID column in the Order Details table lets the query retrieve the product name from the Products table. If you assume that the product name won't change and that the ProductID will never be associated with another product, you could improve the database design by adding a ProductName column to the Order Details table. This addition saves joining the Products table to the query, and as both the Products and Order Details tables grow, that improvement to the design will become even more significant.

The drawback to this improvement is that the Order Details table will become significantly larger than it would under the original design because it now contains a 40-character product name for each item ordered. Given the size of today's hard disks, storage of the additional data isn't a problem; however, the new column will require more data to pass through the disk's I/O buffers, which might lead to a decrease in the total throughput. In addition, each row of the Order Details table will require more RAM in the data cache, which means something else might get squeezed out. This situation is a good example of the balancing act you have to do when you try to optimize performance.

As a general rule, strive for full normalization in your initial design because normalization eliminates opportunities to add redundant or inconsistent data to the database. When you have your design completely worked out, look for ways to balance normalization and performance. Good places to look are queries with multitable joins, aggregates, GROUP BY clauses, correlated subqueries, and foreign key references. Each of these queries can be time-consuming and use a large amount of data cache and create temporary tables in tempdb.

Find the Blockages

The ratio of users to pages or rows and the ratio of the amount of RAM to the size of the database, as we saw earlier, are just two factors that affect data flow. The way applications insert rows into a table can change the flow of data into and out of the table as well. For example, SQL Server will place rows that are inserted into a table that doesn't have a clustered index in the last page allocated to the table. So, locks for INSERT operations will cluster at the end of the table. SQL Server 6.5 has a special mode called Insert Row-level Locking, which essentially gives SQL Server row-level exclusive locks for INSERT operations. SQL Server 2000 and 7.0 have row-level locking for all rows, so both minimize the problems that happen as a result of clustering inserts on the last page. All versions, however, must use a page-level lock when the new row doesn't fit on an existing page. Both the existing page and the newly allocated page will have page locks while SQL Server adjusts their contents. While these exclusive locks remain in place, no operations involving those pages can proceed, including new inserts. This, of course, causes a brief disruption of the data flow.

OLTP applications typically use several tables that are the targets of high levels of INSERT operations, and these tables usually have a large percentage of users attempting to use them simultaneously. If you assume these tables have a small percentage of rows deleted at any time, then a very large percentage of inserted rows will reside on just two pages: the last one in the table and the one allocated to handle the overflow. Add in the fact that the SQL Server storage engine tends to cluster page allocations on one hard disk at a time, and INSERT operations can quickly overwhelm a system. Plus, as the number grows, the time users spend waiting for operations to complete grows as well, usually nonlinearly.

The solution is to make INSERT operations occur as quickly as possible and to spread them out evenly across the table's pages. One benefit of a clustered index is that inserts are scattered throughout the table in index order; therefore, finding the right index order can mean the difference between good and bad performance. For example, if the Orders table used the OrderDate column for the clustered index, new orders would cluster at the end of the table, causing a potential problem. If you use the CustomerID column instead, each customer's orders would cluster in one area, and overlap between customers should be minimal. Furthermore, it's unlikely that a given customer will place two orders concurrently or that two customers with adjacent CustomerIDs will be ordering at the same time.

Of course, the introduction of shared locks by analysis applications further aggravates data-flow problems because the existence of shared locks prevents the imposition of exclusive locks. Fortunately, shared locks tend to be short-lived, and analysis applications don't usually read newly inserted data. OLTP applications that delete rows in heavily used tables, however, will encounter conflicts as analysis applications lock the areas in which new rows are inserted. Where an OLTP application and an OLAP application use the same tables, you need to know which operations your application performs.

Queries that scan large sections of a table, such as those in most analysis applications, are good candidates for improvement. For multitable joins in which searching one of the tables requires a table scan, query performance decreases in proportion to table growth. For multistatement transactions that include a table scan on one of the tables, the length of time SQL Server holds locks increases directly in proportion to table growth. The typical way of eliminating table scans is to create an index, but remember that INSERT and UPDATE operations force changes to every index linked to the table. As the number of indexes grows and as the table grows, performance of change operations decreases. If you have a mixture of transaction processing and reporting applications using the same database, consider copying the data for reports to another table or database because often the queries that generate reports don't benefit from the same indexes that help transactions.

For the implied transactions of simple SELECT queries, shared locks probably won't cause a problem because SQL Server releases each one as soon as it has read the row. User-defined transactions, however, can be a problem because SQL Server doesn't release the locks held by the transaction until it commits the entire transaction. UPDATE statements that affect large sections of a table are a particular problem because SQL Server must hold an exclusive lock until the end of the transaction. SELECT queries that use the HOLDLOCK hint have the same effect. Consequently, user-defined transactions should be as short as possible. This delay is also a good reason not to start a transaction from a client application. If possible, execute all user-defined transactions through a stored procedure; if you must have a client application start a transaction, make sure the application can't delay processing after the transaction starts.

Go with the Flow

As a general rule, applying the concepts of feng shui to database design translates into looking for impediments to data flow. Blockages can arise through congestion in tables that are central to the database's design, through contention for rows, through overuse of the hard disk subsystem, through rotation of pages in the data cache because of queries on large tables, and through inefficient queries that hold locks too long or read through the data too many times—to name a few. Ways to open data flow include copying data for analysis into a separate database from the one used for transactions, denormalizing the design to eliminate excessive joins, splitting tables so that columns that are mostly read are in a different table from those that are mostly changed, increasing the resources available to the server, and placing user-defined transactions in stored procedures.

To put it simply, look for ways to spread the data requests across the widest area possible. For database design, that means being suspicious of diagrams that have a small number of tables linked to all the other tables in the database, especially in OLTP applications. For data storage, it means using a clustered index to spread the data so that you reduce the probability of multiple users wanting adjacent rows or pages. For hardware, it means making sure that available RAM is a large percentage of the database size, that the files are spread across many physical hard disks, and that the hard disk controller has enough capacity to handle the disk I/O. For applications, it means sharing resources such as connections and minimizing use of common resources, such as locks and transactions, as much as possible. For users, it means understanding what they do with the data so that you can adapt the database design to their needs and not to some impractical ideal. By being aware of the natural flows of data through your server and by using common-sense techniques for broadening the paths they take, you'll avoid many common problems long before they exist.

Comments

Plain text