Upgrade Your E-Performance

When it comes to database optimization for e-commerce, maximum performance for the Web user takes precedence over all else. Probably in no other type of dynamic application does SELECT statement performance take such priority over inserts and updates. (I say dynamic because an OLAP system is a totally different situation, in which data is usually updated monthly, quarterly, or even annually.) Contrast this type of dynamic Web application with a typical online transaction processing (OLTP) application you might run inhouse—for example, an order-processing system. You wouldn't want your order-entry personnel waiting around for extended periods for record inserts and updates, so in most OLTP applications, you need to take all types of Data Manipulation Language (DML) performance into account. But in online Web-application processing, at least 90 percent of Web customers' activity consists of browsing, not buying, which means your application needs to handle a high volume of SELECT statements (versus other data-manipulation operations). Consider your own online shopping habits. How much browsing and comparing do you perform before you purchase something (if you purchase anything at all)? And many customers browse online for information before finally going to a brick-and-mortar store to make the purchase.

In this read-intensive atmosphere, data-retrieval optimization is essential. Several obvious methods of improving SELECT performance become important: proper indexing, investigating query plans to ensure efficient index utilization, and using stored procedures for plan caching. Correctly indexing tables will probably yield the biggest performance gain; I also advise you to thoroughly evaluate for efficiency the queries that your application uses. Your database server would probably benefit from an additional 4GB of RAM, but many of us don't have that kind of budget. In addition to indexing and as an alternative to expensive hardware upgrades, one excellent and widely overlooked method that you can use to improve query performance is selective denormalization. In this process, you introduce a small amount of controlled data redundancy into your database. In certain situations, this method can yield dramatic performance increases.

The average e-commerce site has certain database-linked Web pages that most of its Web users hit—for example, a search results page that displays items, descriptions, and pricing information that match a user's search criteria. Query performance on such a heavily hit page needs to be as fast as possible because customers rarely stick around to buy products from a slow site. And because broadband Internet is becoming more widely available, performance will become even more crucial. You can hide a poorly performing Web page from a user who has a 28.8Kbps modem, but that same user might have a Digital Subscriber Line (DSL) next month.

Let's say you have an overtaxed page that receives 10,000 hits every hour and uses a query that executes a five-table join. You can significantly increase performance by turning that query into a simple single-table SELECT statement through some denormalization techniques, which I demonstrate in the following example. The example, which uses a typical e-commerce database schema, adds a new column and runs a bulk update. Any time you introduce redundant data into a database, you need to ensure that changes you make to the original field will propagate to the duplicate field; therefore, the example creates four triggers.

Two characteristics of e-commerce applications let you implement this selective redundancy. First, the Web user issues only SELECT queries against the product information tables. Any updates or inserts that users make will be in other customer information- or shopping cart-related tables. Second, inserts and updates to the product-related tables are performed infrequently and during calculated, low-volume periods.

If you tried this solution with a traditional inhouse OLTP application, table writes would be much more frequent and performance would suffer. So, before you fire up your design tools and start adding redundant columns to your database, make sure the gain in SELECT performance will outweigh the hit in INSERT and UPDATE performance. Web environments almost always benefit from this performance-tuning technique, so let's proceed.

Setting Up

This example involves a fictitious company that sells computers and equipment through mail order and over the phone but that now wants to sell its products online. To create and populate the tables for this example, download and run the script CreateTablesAndPopulate.sql from the "Download the Code" icon at the top of the page. Figure 1 contains the database schema I use in this example.

This design might seem odd—for example, what's the difference between a Product and an Item? Often, when companies venture into e-commerce, they don't sell everything online that they sell through other methods, or they sell products in different combinations. This company sells individual products such as hard drives and mounting kits, but on the Web, the company sells those items only as a set. The Products table in the example schema represents legacy data. The Items table contains the most atomic unit of merchandise the company will sell online. The Products_Items table lets you associate Items for sale over the Internet with the existing units in the Products table. The schema also includes a table for Packages, which are groupings of Items that the company sells at a 10 percent discount. For example, a computer is a Package made up of Items such as a hard drive, RAM, a processor, and so on. This table structure lets you break down a Package in a customer's cart into its component Products for assembly.

In this scenario, the two most heavily hit pages contain lists of Items and Packages for sale. The major problem with this table design is the location of the price field. Getting information about an Item and its price requires a three-table join, such as the one in Listing 1's stored procedure. Obtaining Package information and pricing requires the five-table join in Listing 2.

These stored procedures would be simpler and run faster if the pricing information weren't five tables away in the Products table. To improve performance, you could add a price column to either the Items table or the Packages table. The company's research shows that most shoppers concentrate their browsing on Packages, not Items, so the best idea is to include the price column in the Packages table. Look at the GetPackageInfo stored procedure in Listing 2 again and imagine the difference if the Packages table included a PackagePrice column. GetPackageInfo would change to:

SELECT * FROM Packages

That's quite a savings in processor time when you consider how many users hit the Package page every hour: 10,000 single-table SELECTs incur much less overhead than 10,000 mega joins.

To begin modifications, create a nullable column of the money data type in the Packages table:

ALTER TABLE Packages ADD PackagePrice money NULL

With that column in place, you need a way to enter each row's pricing to reflect the sum of its Items, minus a 10 percent package discount. To accomplish this task, you can create a script such as the one in Listing 3 to update every row in the Packages table with the current prices.

Note that this script looks similar to the original GetPackageInfo stored procedure, with the addition of a temporary table. The temporary table #PriceHolder is populated with the PackageId of each record in the Packages table and the summed price of each record (minus 10 percent), which you obtain by joining the remaining tables. The PackagePrice column is then updated through a join on the #PriceHolder table. Because this job will affect every row in the Packages table, I recommend scheduling it to run during off-peak hours (typically in the early morning).

After you run the script, you have an accurate pricing column for each Package. Now you can rewrite the stored procedure to get Package information as

CREATE PROC GetPackageInfo2 @PackageID INT
AS
SELECT * FROM Packages WHERE PackageID = @PackageID

Running either 'EXEC GetPackageInfo 2' or 'EXEC GetPackageInfo2 2' returns from the Packages table the same information about the 500MHz computer with 64MB of RAM, but the second command puts much less load on the SQL Server machine. (Description fields tend to be long, so I suggest viewing the results in a Query Analyzer grid.)

Planning Triggers

Product information is updated daily, and the price column won't serve its purpose unless you keep it updated as well, so you need to create some triggers. Let's look at the schema to determine which specific data modifications will be involved.

In the Products table, you need to account for updates to the ProductPrice column because changes in individual Product prices affect the total price of a Package. Inserts to this table aren't a problem because they don't affect a Package's price until an association is made in the Products_Items table. Nor do you need to worry about deletes because any deleted Product would still need to be unassociated. A trigger on the Products_Items table can handle those two scenarios.

In the Items table, inserts and deletes are also not a concern because they affect pricing only in the association tables Products_Items and Items_Packages. However, the ItemIsAvailable bit column needs attention. In the example, an Item's availability affects Package pricing; the company charges only for the Package Items that ship. Why would the company ship an incomplete product? In this case it probably wouldn't; however, I have encountered IsAvailable-type bit fields so often that I thought including such a column for training purposes was appropriate. So, Package prices need to be updated when an Item becomes available or unavailable.

The triggers in the Products_Items table and the Items_Packages table will need to encompass more than the update triggers on the Products table and the Items table. The triggers need to account for adding new Items to Packages, deleting Items from Packages, and changing quantities of Items in Packages. You have the same needs for Products that make up Items. So you need to create one trigger on Products_Items and one on Items_Packages to apply to inserts, updates, and deletes.

Creating the Triggers

Let's examine the ProductPriceChange trigger that Listing 4 creates on the Products table. The trigger fires when the ProductPrice column is updated. The first part of the trigger uses the previous joins to sum the total price of the Packages minus 10 percent. Note how this part of the trigger is executed in a subquery and aliased as TableOne. This technique is necessary because an UPDATE statement won't support the use of aggregates in the SET portion of the statement. The TableOne result set is then joined with the Packages table and restricted by another subquery that ensures it updates only the Packages that contain the Product whose price was updated.

Listing 5 shows the Items table trigger ItemAvailabilityChange, which is similar to the ProductPriceChange trigger. This trigger also performs the joins to calculate pricing, and the update is restricted to only the Packages containing Items whose availability has changed.

You have now covered all bases except for a situation in which the Products or Items that make up a Package change. To address modifications to the Products that make up an Item, the script in Listing 6 implements the Products_ItemsChanges trigger on the Products_Items table. In the Products_Items table, new rows can be inserted and existing rows can be updated or deleted. Any of those conditions might affect Package pricing, so Products_ItemsChanges accounts for any type of data modification to the Products_Items table. The pricing calculations in the table aliased as TableOne are now familiar; however, the subquery that restricts which Packages the trigger updates has changed significantly. The trigger joins the special Inserted and Deleted tables to determine which Items have been modified. Then the trigger joins those results to the Items_Packages table to determine which Packages the modified Items belong to, again restricting the price update to only those Packages affected.

The Items_PackagesChanges trigger, which Listing 7 creates on Items_Packages, is almost identical to the trigger in Listing 6 and covers all remaining relationship changes that can occur to the member Items of Packages (e.g., updates, deletes, or inserts to the Products_Items table). Now that all the parts are in place, let's test the triggers. First, use the following code to get the current price of PackageID 1, a 500MHz computer with 128MB of RAM:

EXEC GetPackageInfo2 1

This command returns a price of $1080. To test the first trigger, change the price of the Pentium III processor from $500 to $600 in the Products table:

UPDATE Products
SET ProductPrice = 600 WHERE ProductId = 6
GO
EXEC GetPackageInfo2 1

The price of Package 1 now returns a price of $1170, exactly $90 ($100 minus 10 percent) more than before. Next, modify the Items table so that RAM is unavailable:

UPDATE Items
SET ItemIsAvailable = 0 WHERE ItemId = 3
GO
EXEC GetPackageInfo2 1

The price of the computer displays $180 ($200 minus 10 percent) less, at $990. Now modify an Item's member Products to test the third trigger. If hard drives start shipping with integrated mounting kits, you'll need to remove the mounting kit Product from the 6GB hard drive Item:

DELETE Products_Items
WHERE ItemId = 1 AND ProductId = 2
GO
EXEC GetPackageInfo2 1

The Package price has dropped by the price of the mounting kit, $13.50 ($15 minus 10 percent), to $976.50. To test the final trigger, suppose the computer now ships with dual processors. Add another Pentium III processor Item to the computer Package:

UPDATE Items_Packages
SET Quantity = 2 WHERE PackageID = 1 AND ItemId = 4
GO
EXEC GetPackageInfo2 1

The final price on the computer is $1516.50, or $540 ($600 minus 10 percent) more than before. You can be confident that the PackagePrice column will remain consistent through any pertinent data changes.

Mission Accomplished

Now let's look at the performance results of all the modifications and see what kind of performance gain we've achieved. I took the STATISTICS IO that Figure 2 shows from the original stored procedure, which used a five-table join. The physical reads will drop off after the data pages are loaded, but 14 scans and 33 logical reads is significant, especially considering that this query will be executed thousands of times an hour. Contrast the statistics in Figure 2 with the statistics output from the new stored procedure, which Figure 3 shows. What a difference! The load on the server has dropped from 14 scans and 33 logical reads to 1 scan and 2 logical reads. Clearly our performance-enhancing mission has been accomplished.

But couldn't you realize another performance gain by following the same procedures and implementing an ItemPrice column in the Items table? Absolutely, and I encourage you to use the preceding example as a template to do just that.

As you can see, the introduction of redundant data into a database demands thorough planning for maintaining that data. But in a high-traffic Web site, the benefit of eliminating a five-table join from one of the most frequently accessed pages far outweighs the overhead added during data modification. If any of your applications fit a similar profile with infrequent updates and heavy query traffic, you might want to consider this underused technique to boost your users' speed. And happy users just might boost your bottom line.

Comments

Plain text