Divide and Conquer

DBAs know that the first step in determining the long-term viability and capacity of any system is to implement performance-monitoring metrics. Mindful of that imperative, several years ago we devised a program to keep tabs on performance changes in a call center's processing system that runs Windows NT and SQL Server 7.0. Initially, we found that the Performance Monitor showed us current performance metrics but didn't give us insight into long-term trends and capacity changes. So to collect data over time, we set up long-term monitoring by taking snapshots every half hour of performance metrics such as CPU and I/O utilization, numbers of users and processes, order processes, and transactions. Then we stored the metrics in a SQL Server database. The year we began our monitoring program, we collected data for only a few consecutive months. This exercise showed us that data gleaned from such a short time period produced only short-term information—too few data points to determine a trend. However, the following year, when we monitored for longer time frames, the data we gathered revealed significant trends. Over a year's time, we observed an alarming elevation in CPU utilization that we attributed to an increase in the center's custom application features and the addition of many system users.

First Attempts

To stem the rise in CPU utilization, we repeatedly upgraded the hardware until SQL Server 7.0 was running on the largest and fastest CPU available. Next, we turned to a comprehensive program of tuning the database, hardware, and SQL statements. Although performance tuning became an ongoing process, demands on the CPU ultimately outpaced our optimization efforts. In July of 2000, we determined that the current database system would likely reach full performance capacity by year's end. Figure 1 shows actual CPU utilization from July through December 2000. Rewriting the applications would take much longer than a year, and we weren't certain that modified applications would solve the problem even then. We had to take more drastic measures.

Off-Loading Reporting Functions

Transactional replication seemed to be the answer because our goal was to reduce the load on the publisher and because the reporting queries didn't perform database updates. However, we decided that replicating the entire database wasn't feasible. Because the database exceeded 40GB, the amount of time that the Snapshot Agent would need to create a snapshot was prohibitive and would require more hardware on the distributor and subscribers. In addition, because all tables participating in transactional replication must contain a primary key, we'd need to modify the database.

The final solution hinged on the application's custom stored procedures. During our performance tuning, we painstakingly captured and analyzed SQL traces and metric data to identify stored procedures and ad hoc SQL code in the call center's application that were taking the longest to run and using the most CPU capacity. We assigned the stored procedures and SQL statements to two categories that each required a different optimization technique: short-running online transaction processing (OLTP) queries and long-running, read-only reporting queries. The read-only queries were perfect for off-loading (and therefore reducing the drain on the CPU). We considered diverting the read-only queries to a data warehouse or data mart but rejected that option because they were an integral part of the application and needed to run in near realtime. However, we realized that we could replicate the portions of the database that these queries used and divert them to secondary servers. This solution not only increased the available system capacity in the short term by letting us off-load work to subscribers, but it also proved to be an innovative and scalable solution for the future. In the latter half of 2000, we began to off-load some of the reporting functions to replicated servers.

Project Steps

Once we determined the solution, designing the project became our first task. The project unfolded in the following five stages.

Redesigning the architecture. During the initial system redesign, we identified three important criteria: the modified architecture should be dynamic, fault-tolerant, and scalable. To make the call center's custom application more dynamic, at first we programmed it to determine at startup time which server would run a particular reporting function. However, in that scenario, if a reporting server failed after application startup, the reporting queries needed to be rerouted so that they would still run. And if the server location wasn't set during startup, the only way to correctly determine where to reroute the reporting functions was to restart the application. That approach wasn't dynamic enough, so we placed a lookup table on the main server. This table contains a list of reporting queries and the name of the server on which these reporting queries run. Now, when a user submits a report, the application queries the lookup table and runs the report on the server that the entry in the lookup table names. Although accessing a lookup table costs overhead, the performance gains more than offset the drain.

Scalability was the next criterion we addressed. We divided the reporting functions into three categories according to their business purposes, then assigned each category to a corresponding server. The categories specify how current the data must be—realtime, near realtime, or day-old. Realtime reporting queries must run on the main server because the nature of the queries depends on the instantaneous state of the database. Near realtime functions can tolerate the 8-second delay that replication requires, so we assigned this group to a second server, which is about 20 percent of the main database size. Day-old reporting functions involve long-term data, so we decided to keep an image of yesterday's data on a third server. This architectural redesign reduced the amount of data that we needed to replicate and provided lots of scalability by letting us spread out functions among multiple systems.

One advantage of the redesign is that all reports that access the near realtime server use the same data. Therefore, we can spread multiple reports across multiple subscribers. Although the main server is a Microsoft Cluster Server (MSCS), we haven't needed to cluster the reporting servers. If a reporting server fails or a replication error occurs, we can quickly and easily modify the application lookup table to switch subscribers or point back to the main server for a short time. Before the redesign, we had to change subscribers or send the reports to the main server manually. Using multiple subscribers for the same reports gives us the added benefit of fault tolerance. Because transactional replication incurs the same overhead on the publisher regardless of the number of subscribers, we can implement two reporting servers for much less overhead and cost than one clustered reporting server. The application can continue to run, even when one reporting server fails. Figure 2 shows the architectural redesign.

Modifying the application. Because we developed and maintained the application in-house, modifying it was feasible. However, modification was still an enormous task because we had to analyze 180 reports to determine which reporting category they belonged in: realtime, near realtime, or day-old. The application had to be able to determine the server on which each of the replicated functions should be run by reading the lookup table, connecting to the appropriate server, moving the data to that server, then disconnecting when the reporting query had finished. As we continue to enhance the application, we modify most reporting queries to use this lookup table, even though they might initially still point to the main server. Using this lookup table gives us greater flexibility in routing the queries, the ability to move them at a later time, and scalability for future reporting requirements.

Implementing transactional replication. We chose transactional replication for this project for several reasons. Because the reporting queries were read-only, we didn't need to implement bidirectional replication. In addition, transactional replication creates much less overhead on the publisher than merge replication (because merge replication is trigger-based, and triggers claim significant overhead). Also, we didn't want the added DBA workload that merge replication generates. In implementing transactional replication, we took great care to coordinate the replication with the application modifications. We especially needed to make sure that all the necessary tables and stored procedures were replicated. This effort required a lot of coordination and good communication between the development and DBA groups.

Initially, we implemented the replication before making the application modifications so that we could test the replication's performance and configuration. During testing, we determined that the replication's performance was unacceptable. The default value of the polling interval for both the Log Reader Agent and the Distribution Agent was 10 seconds, so a transaction could take as long as 20 seconds to move from publisher to subscriber. That replication delay was unacceptable because the users were accustomed to realtime or near realtime reports. By modifying the polling intervals for both the Log Reader Agent and the Distribution Agent to 2 seconds, we shortened that time to a maximum of 4 seconds, an acceptable performance level.

We discovered one important architectural design problem while we were incorporating transactional replication into the change-control process. By default, we put all the replicated objects into one publication. Although this process made administration easier, it also made change control difficult. Call-center business rules (and hence the reports) often change, so we had to alter the corresponding stored procedures and schemas. Each time we made a change that affected a replicated object, we had to break and recreate the publication. The solution was to make each object a separate publication, so that now when we change an object, all we have to do is modify a specific publication and approximately 100 subscriptions. We don't have to create a snapshot for the entire publication set, only the affected objects. Granted, we have more publications to manage, but administration is much easier in the long run.

Testing and tuning the solution. Every time we moved a new reporting query to the replicated servers, we tested it thoroughly. Because a report might access 10 to 20 tables and call several stored procedures, you might easily move a report that's missing an underlying table or stored procedure. To avoid this possibility, we completed a full application regression test (by testing all available reporting options) every time we made a change to the replication model. In addition, because we were running only reporting queries on these servers, we tuned the indexes only for reporting queries and not OLTP. Using this process, we could remove some indexes and modify others for better performance. We still use this process. To keep the data and indexes in an optimum state, we also run DBCC CHECKDB and DBCC REINDEX every week.

Maintaining the distributed system. Maintaining the system became slightly more difficult as we added servers. We needed to administer a distributor and two or more subscribers in addition to the main server. To properly and efficiently handle change requests and monitor the overall health and availability of the replication publications, we found we needed to allocate at least 30 percent of two senior DBAs' time to administering the replication architecture. Part of their time is devoted to monitoring discrepancy thresholds. At the time we implemented the redesign, we established discrepancy thresholds for each table publication. Now, SQL Server compares subscriber table row counts against the row counts in the main server, and when the established discrepancy threshold is exceeded, SQL Server sends an urgent notification to the database team.

During the testing process, we decided that because of business rules, we needed to replicate user accounts and passwords to the subscribers. Transactional replication can't replicate this user information, so we had to develop our own method of replicating it by using Data Transformation Services (DTS) packages. Through the SQL Server Agent, SQL Server executes the DTS packages to update the login and user information every 5 minutes. The most typical changes to user information are changing passwords, changing roles (an application-security feature to group users for different types of data access), and adding or deactivating users.

Early on, we decided that because the reports running on the day-old server required the entire 40GB of the call center's data, running snapshot replication on that server wasn't feasible because of the time that snapshot replication must hold a table lock on the replicated table. In addition, because of primary key problems, transactional replication also wasn't feasible. Instead, we decided to maintain the day-old server through a process of backup and restore. This process provides several advantages to the data center such as higher performance than snapshot replication, and we would have performed a backup anyway. An additional advantage of this process is that we validate the backups immediately after they're created by restoring them.

Maintaining the day-old server is a delicate balance of automated and manual tasks. The most crucial objective is to ensure that the day-old server is current up to midnight of the previous day. The secondary objective is to complete the backup-and-restore process in time to put the system back online before the next business day. The process, which requires night operators and on-call DBAs to ensure successful completion, follows:

The system operators reload the databases on the day-old server each night from the most current production backup.
The system operators restore production transaction-log backups on the day-old server after the database restores are completed to bring the database as up-to-date as possible.
If the day-old server isn't online before the start of the next business day, the DBAs point the day-old reports back to the main server until the day-old server is again online.
If the day-old server can't be brought online within a reasonable amount of time (approximately 2 hours) after business hours begin, the day-old reports remain pointed to the main server until the day-old server can be refreshed during the normal schedule that evening.

Results

After implementing the replicated systems and slowly migrating more reporting queries off the main server, we've seen a dramatic decrease in CPU utilization as well as an increase in performance. Figure 3 shows the drop in CPU utilization compared to the spike of the 6 previous months. (We implemented the replication in the latter part of 2000.) Although SQL Server replicates a large amount of data very quickly, the CPU resources used on the distributor are fairly low because transactional replication is highly efficient. The reduction in CPU utilization occurs even as we continually add users to the server. Figure 4 displays user counts.

The resources that the reporting servers use are also within acceptable limits. We constantly monitor these resources and set an emergency plan in motion if they begin to rise to unacceptable levels. We simply add more subscribers and update the pointer table to point reports to those servers, then distribute the load further without the users ever knowing this activity is occurring—and without making additional changes to the applications. Another way of measuring our success is in the response time of certain queries. Figure 5 shows an example of one of those queries. As you can see, this query posted a 500 percent performance gain after we implemented replication.

Although, in general, this solution is successful, it carries a few limitations. Some problems with replication require rebuilding a publication and resyncing the data using a fresh snapshot. Also, because of locking problems, we can't always create new snapshots during normal business hours while the system is busy.

Innovative and Scalable

A successful project is always the result of several contributing factors. We can't emphasize enough the importance of good coordination between the development and DBA teams. You also need excellent system-performance metrics, long-term system-resource monitoring, and an application equipped for response-time measurements. Defining metrics early in the process is crucial so that you can assess your degree of success. Most importantly for our project, redirecting the data flow solved our major problem and opened extra avenues of improvement. Implementing transactional replication to off-load reporting queries achieved two goals. It reduced the CPU utilization on our main server, thereby increasing CPU performance, and it left us free to add more reporting servers as we added new users. This scalable, robust solution should last for years to come.

Comments

Plain text