Copying a huge amount of data to several servers poses a formidable challenge. SQL Server's copy utilities—among them bulk copy program (bcp), Data Transformation Services (DTS), and snapshot replication—can take more time to copy the data than is available in most production environments. After experimenting with these utilities, I developed a way to optimize the performance of large snapshot replication to levels acceptable in a real-world scenario. In "Bottleneck Blues," May 2002, I analyzed the performance of snapshot replication, identified where the performance bottlenecks occur, and suggested a way to clear the bottlenecks by customizing the replication steps.
In this article, I detail the full implementation of these customized steps. You configure the steps manually through Enterprise Manager and use T-SQL, a remote stored procedure, and SQL Distributed Management Objects (SQL-DMO) in a custom COM object to customize the steps. You can apply this article's listings and executable scripts to push and pull snapshot replication subscriptions in SQL Server 2000 and 7.0 without modification. (To download the listings and executables, go to the "Download the Code" link at the top of the page.) To confirm the mathematical analysis that I presented in the first article, I include test results confirming that you can improve snapshot replication performance significantly when you optimize the replication by customizing its steps.
Configuring the Agent Jobs
To configure snapshot replication, you typically create a publication first, add articles, then add subscriptions to the publication. The configuration process creates a Snapshot Agent job and one or more Distribution Agent jobs. (For details about the process, see "Bottleneck Blues" and review "Implementing Replication" in the Enterprise Manager section of SQL Server Books Online—BOL.) The agents perform the replication by executing steps in their jobs. By default, a Snapshot Agent's job consists of three steps—Log agent startup message, Run agent, and Detect nonlogged agent shutdown. This job's Run agent step executes the snapshot.exe command-line utility. Also by default, a Distribution Agent job consists of the same three steps as the Snapshot Agent job. However, its Run agent step executes the distrib.exe command-line utility, which consists of three substeps. The schema (sch) substep deletes the target records, the bulk copy (bcp) substep bulk-copies the snapshot to the target tables, and the index (idx) substep creates indexes, if any, on the target tables. Because indexes and foreign key constraints are often defined on the target tables, the sch and bcp substeps become the bottlenecks. But if you could drop the indexes and constraints before these two substeps and recreate them afterward, the much faster TRUNCATE operation (or the default of dropping, then recreating the table) could replace the DELETE operation in the sch substep, and you wouldn't have to maintain the indexes during the BULK INSERT operation in the bcp substep. Fortunately, you can modify the Distribution Agent's job to incorporate the additional steps. You do so by inserting the Drop Indexes and Keys (drop) step before the Run agent step and inserting the Re-create Indexes and Keys (recreate) step afterward.
Adding steps to the Distribution Agent job creates impediments to the job's smooth operation, so first you have to determine the best way to configure both agent jobs to run most effectively. In Enterprise Manager, you can set the customized Distribution Agent job to disable, to start running at a scheduled time, or to start when the SQLServerAgent service starts. (The SQLServerAgent service is a Windows 2000 or Windows NT system service.) When this service starts, it also starts all the jobs that you set to Start automatically when SQL Server Agent starts. Setting a Distribution Agent job to start automatically doesn't work for this customization project. Once a job starts automatically and reaches Run agent, by default the job executes distrib.exe with a Continuous argument to poll new snapshots at a certain polling interval until the SQLServerAgent service stops or until you stop the job manually. Consequently, the Distribution Agent job's recreate step never executes, and the drop step never executes for the next snapshot polled.
If you set a customized Distribution Agent job to start running at a scheduled time, the job proceeds through each step without hanging on Run agent and stops until the next schedule begins. However, scheduling a Distribution Agent job doesn't work well for the customization project either. Predefining such a schedule is difficult because the schedule must ensure that the Distribution Agent job doesn't start before the Snapshot Agent job is finished and must not overlap the schedules of other Distribution Agent jobs.
To start and stop each Distribution Agent job at the right time, you have to serialize the Snapshot Agent job and the Distribution Agent jobs. To accomplish the serialization, you first need to disable each Distribution Agent job schedule so that the job starts only manually or on call. Next, customize the Snapshot Agent job so that when its Run agent step is finished, the job starts one Distribution Agent job at a time. You customize the Snapshot Agent job by inserting the new Start Distribution Agent Jobs (start distributors) step immediately after the Snapshot Agent job's Run agent step, as Figure 1 shows. During the start distributors step, the sp_start_job stored procedure executes against each Distribution Agent job one at a time. Figure 2 shows two Distribution Agent jobs started, one for each remote server (subscriber).
Adding the Start Distributors Step
Now, let's look at how to add the start distributors step to a Snapshot Agent job. A typical and straightforward way to customize an agent job is to use Enterprise Manager. First, to customize a snapshot replication, open the Table Article Properties dialog box for each article, then click the Snapshot tab. Select the Delete all data in the existing table option, then clear the referential integrity and indexes options. (Note that choosing the DROP the existing table and re-create it default option instead of Delete all data in the existing table would work as well if you have no problems dropping tables on a production server.) The .sch file that the Snapshot Agent generates for each article will then contain a TRUNCATE TABLE statement for the corresponding target table, and the .idx file will be empty. You can find these .idx and .sch files in the snapshot folder, which by default is in the ReplData subdirectory under the SQL Server home directory (e.g., D:\Microsoft SQL Server\mssql\repldata). To view your snapshot folder's path, select the name of your publisher server from the server tree in Enterprise Manager, pull down the Tools menu, select Replication, then Configure Publishing, Subscribers, and Distribution. The Publisher and Distributor Properties dialog box will appear. On the Publishers tab, click the (...) button in the third column of the selected publisher row to bring up the Publisher Properties dialog box. In that dialog box, the Snapshot folder text box shows the path of the snapshot folder.
Next, to navigate to the agent jobs that your snapshot replication has created, again in Enterprise Manager, expand the server that functions as the publication's distributor. (Note that you can choose the same server for both distributor and publisher, but if this server has to run both agents, it will expend more resources than two separate servers.) Expand Replication Monitor, Publishers, and a specific publisher's name, then select your publication. Because at this time Enterprise Manager's right pane lists one Snapshot Agent job and one or more Distribution Agent jobs for that publication, this window, which I call the agent jobs window, is the best place to customize the agent jobs. Right-click any agent job and select Agent Profiles. The Agent Profiles dialog box contains read-only, system-type profiles that you can't modify. However, you can create a modifiable user-type profile by copying from an existing system profile. To issue the maximum number of concurrent bulk copies, the snapshot.exe and distrib.exe commands will use the profile's parameters (e.g., MaxBcpThreads) at the Run agent steps.
To customize the Snapshot Agent job, perform the following steps. Right-click the job named Snapshot in the agent jobs window, then select Agent Properties. In the Properties dialog box, click the Schedules tab, then set the starting time for the Snapshot Agent job. Click the Steps tab, select Detect nonlogged agent shutdown, then click Insert. In the New Job Step dialog box, which Figure 2 shows, enter the name of the step (e.g., Start Distribution Agent Jobs), then accept TSQL as the default command type. Next, select msdb as the database, then enter in the Command window an sp_start_job stored procedure call for each Distribution Agent job.
The two calls to sp_start_job start two Distribution Agent jobs, one for each subscriber. To find the name of a Distribution Agent job for the sp_start_job stored procedure's @job_name parameter, right-click a Distribution Agent job in the agent jobs window, then select Agent Properties. In the Properties dialog box, copy the agent name. This custom step starts each Distribution Agent job right after the Snapshot Agent job's Run agent step. To modify the flow of each step upon success or failure, double-click a step. In the Edit Job Step dialog box that appears, click the Advanced tab, then select your On success action or On failure action. Typically, you want to proceed to the next step on success so that the job runs serially through the steps until it finishes.
Adding the Drop Step
After you customize the Snapshot Agent job, you customize the Distribution Agent jobs by adding steps to drop and recreate all indexes and key constraints. To find the Distribution Agent jobs you want to modify, follow the path in Enterprise Manager that you took to find the Snapshot Agent job (in the agent jobs window). Right-click each Distribution Agent job, then select Agent Properties. In the Properties dialog box, click the Schedules tab, double-click each enabled schedule, select any schedule-type option except Start automatically when SQL Server Agent starts, then disable the schedule. Disabling the schedule causes the job to start only when the start distributors step calls it in the Snapshot Agent job. Switch to the Steps tab. Insert the drop step before the Run agent step, enter the step name, then accept the default command type and database, as Figure 3 shows. In the Command window, enter a stored procedure that drops all indexes and key constraints on all target tables in the remote subscription database. This stored procedure (hub_remote_drop_save_subscribed_article_indexes) accesses system tables in the publication database and therefore must be stored in the publisher. However, the Distribution Agent job runs Figure 3's EXEC statement at the distributor. This statement assumes that the publisher and the distributor are the same server. Otherwise, the statement would have included the qualified name of the HUB1DB linked publisher, as follows:
EXEC HUB1DB.master.dbo.hub_remote_drop_save_ subscribed_article_indexes 'hub2_publish_db', 'CGIS1', 'hub4_publish_db', 'addressing_truncate'
Figure 3's EXEC statement drops all the indexes and key constraints of target tables at the subscriber by calling the stored procedure from the distributor.
Now that you know how to create the drop step in the Distribution Agent job, you can implement the hub_remote_drop_save_subscribed_article_indexes stored procedure. Implementation occurs in four steps.
Step 1. After you've provided the @publication_db, @subscriber, @destination_db, and @publication variable names, the stored procedure collects a string of destination table names (in @article_list) for subscribed articles in the publication by using a cursor to loop through all the names in the replication system tables. The code in Listing 1 creates the cursor. (Listing 1 is a snippet of the complete stored procedure—hub_remote_drop_save_subscribed_article_indexes.sql—which you can download from the Web.) To collect destination table names from the sysarticles system table, the stored procedure first has to find the subscribed article IDs and the publication ID. Note that this code takes into account the possibility that subscribers might not always subscribe to all articles in a publication. The subscribed article IDs in the syssubscriptions system table are identified by the combination of the subscriber's server ID (srvid) plus @destination_db. You can access srvid from the sysservers system table when you provide the subscriber's name, and you can retrieve the publication ID from the syspublications system table when you provide the publication's name. (For more information about using system tables, see Kalen Delaney's series of Inside SQL Server columns: "Accessing System Tables," March 2000; "Property Functions, Schema Views," April 2000; and "Direct Access," May 2000.
Step 2. Using the target-table names string, from the distributor, the stored procedure calls the remote procedure hub_drop_save_ article_indexes (stored in the subscriber) to drop and save indexes and constraints for target tables at the subscriber, as Listing 1 shows. Note that to allow a remote stored-procedure call, you have to join the distributor and the subscriber to each other as linked servers by giving them data-access and remote-login permissions. To fulfill those requirements, in Query Analyzer you can execute the stored procedures that Listing 2 shows. For details about using these system stored procedures to add linked servers, see sp_addlinkedserver under "Transact-SQL Reference" in BOL.
Step 3. At the subscriber, the remote stored procedure uses a COM object, which calls the SQL-DMO API to drop indexes and key constraints on the destination tables and save their recreation scripts. I recommend using SQL-DMO objects for several reasons. SQL scripting—including scripting for index or key constraint creation—can become complex, so using SQL Server's script-generation functionality is better than manually creating the scripts through T-SQL code. Enterprise Manager lets you generate DROP, CREATE INDEX, and key constraint scripts for the set of destination tables by right-clicking a subscriber's table, selecting All Tasks, then selecting Generate SQL Script. However, any change to indexes, key constraints, articles, and the subscription makes the scripts obsolete.
SQL-DMO provides one Script method for almost every object. The Script method returns a SQL statement that you can use to recreate the referenced object. After the drop step has saved all the scripts of existing indexes and key constraints for destination tables in a stored procedure at the subscriber, the recreate step of the corresponding Distribution Agent job can execute remotely.
Step 4. To facilitate using SQL-DMO objects and their Script methods in the remote stored procedure, you can encapsulate all the SQL-DMO code in a custom COM component. This component exposes a simple interface method that the stored procedure can call. The method drives all the dropping and scripting. I named this component Hubindex, the interface TableIndex, and the method Drop_Save_Table_Indexes. I also implemented this component as an in-process server that runs within the same memory space as that of the stored procedure. In Visual Basic (VB), you create this component by using an ActiveX DLL project, and you must check the reference to the SQL-DMO Object Library.
SQL Server provides OLE Automation stored procedures such as sp_OACreate to create a COM object and sp_OAMethod to call its method. Listing 3 shows how I used those OLE Automation stored procedures in the remote stored procedure. (Listing 3 is a snippet of the complete remote stored procedure—hub_drop_save_article_indexes.sql—which you can download from the Web.) Given HubIndex.TableIndex, the programmatic identifier (ProgID) of an OLE object, sp_OACreate creates the object and outputs an object token that identifies this object in T-SQL. You must pass this object token when you use sp_OAMethod to call the Drop_Save_Table_Indexes method of this object. For a script that shows you how to use OLE Automation stored procedures, see the "OLE Automation Sample Script" topic in BOL.
The Drop_Save_Table_Indexes method in turn calls the Script methods of SQL-DMO objects and saves the creation scripts in a programmatically created stored procedure (part of whose name is @publication), which stores them in @destination_db. For example, given a publication named addressing_truncate, the Drop_Save_Table_Indexes method would programmatically create a stored procedure named create_addressing_truncate_indexes. By naming the programmatically created stored procedure this way, you can distinguish among multiple stored procedures within the same subscription (destination) database. A safeguard for using this drop (destructive) method is the boolean flag @is_drop, which saves the scripts without dropping (destructing) the corresponding indexes and key constraints if it's set to 0 (false). This method is general enough for dropping and saving the scripts of indexes and key constraints for any list of tables.
Note that the custom COM component, the remote stored procedure, and the target tables are all local at the subscriber. The actual dropping and saving of the indexes and keys occur at the subscriber without a round-trip back to the distributor until the remote procedure finishes. For a detailed description of the drop-and-save process and a VB code snippet of the VB COM object's Drop_Save_Table_Indexes method, see the Web sidebar "Stop, Drop, and Save," which includes Web Listing A. Web Listing A is a snippet of the Drop_Save_Table_Indexes method, part of the complete executable DMO_drop_save_indexes_listing.zip VB project that implements this method.
Adding the Recreate Step
The Distribution Agent remotely calls the create_
Putting Optimization to the Test
How does a snapshot replication perform after you've optimized it by customizing the Snapshot Agent and Distribution Agent jobs? Let's look at the optimized replication's performance during my tests. Figure 5 shows the elements of the test environment and Figure 6, describes the characteristics of each test. Note that each test result is a typical representative of five to six samplings of the same test. I performed all tests during the 10:00 pm to 6:00 am off-peak hours to make sure that other applications didn't introduce an irrelevant load.
Because the sp_start_job stored procedure starts a job and returns immediately without waiting for the job to finish, it's essentially an asynchronous stored procedure (i.e., the completion and return of this stored procedure doesn't signify the completion of the job that this stored procedure starts). The start distributors step in a Snapshot Agent job (see the Command window in Figure 2) therefore essentially starts all the Distribution Agent jobs concurrently. To achieve serial distributions (in Tests 2 and 4), first select any schedule-type option for all Distribution Agent jobs except Start automatically when SQL Server Agent starts, then disable the schedules. Choosing a schedule-type option prevents the Run agent step (i.e., the distrib.exe command) from running continuously. Second, start only one Distribution Agent job at the Snapshot Agent job's start distributors step, then insert another start distributors step before the last step of the first started Distribution Agent job. This sequence starts another Distribution Agent job, and so on, serially. Table 1 summarizes the results of these tests.
The most significant performance difference between concurrent and serial optimized distributions (i.e., between Test 1 and Test 2) occurs at the drop and recreate steps. The reasons are clear: The drop and recreate steps take place on each subscriber independently, without the intervention of the distributor. For concurrent distributions, only one drop and one recreate step appear to take place—a complexity of O(c) (constant-time complexity), regardless of the number of subscribers—whereas for serial distributions to n subscribers, n drop and recreate steps will occur—a complexity of O(n) (linear-time complexity). For more information about quantifying the execution time mathematically, see "Bottleneck Blues." The bcp step takes place between one distributor and n subscribers. Beyond some threshold t, the number of concurrent bcp processes start overloading the distributor and lose the advantage of concurrency. To further optimize, a more sophisticated customization would run t sp_start_job commands against t Distribution Agent jobs concurrently at the Snapshot Agent job's start distributors step. Assuming these jobs will finish roughly at the same time, pick an arbitrary Distribution Agent job and insert another start distributors step before the recreate step to start another batch of Distribution Agent jobs. The effect is serial batches of concurrent t distributions. Because the start distributors step is asynchronous, it essentially overlaps with its next step, the recreate step. In effect, your job would have n/t batches of drop and bcp steps, but only one recreate step (that of the last batch).
The following comparisons further highlight the advantage of optimized concurrent distribution and high-end hardware based on the test results:
- o versus u: The optimized concurrent distribution (Test 1) is 63 minutes faster than its unoptimized counterpart (Test 3). The savings, as mentioned before, is largely from the sch step, and to some extent from the bcp step.
- concurrent versus serial: Two concurrent distributions didn't overload the distributor; therefore, the threshold t apparently is more than two (t > 2).
- high end versus low end: The I/O is much slower on subscriber CGISTEST1 than on CGIS1 because the latter is equipped with a RAID 5 of 6 fibre channel SCSI drives. However, in optimized distributions (Tests 1 and 2), the low-end subscriber performs no worse than the high-end one. The optimized-distribution result doesn't hold for unoptimized distributions (Tests 3 and 4), in which the low-end subscriber falls slightly behind. This slippage occurs because the sch and bcp substeps are I/O-intensive operations that can flood a low-end disk drive with I/O requests much faster than a high-end drive.
Manual Customization Speeds Performance
Customizing the Snapshot Agent job and the Distribution Agent jobs is crucial for optimizing the performance of a snapshot replication. My test results show a significant performance gain in sch steps and better scalability in bcp steps for an optimized snapshot distribution. Enterprise Manager provides a friendly interface for manually customizing these replication agent jobs. However, in a large-scale environment that has many servers and might be reconfigured frequently, manual customization becomes impractical. You need a faster, more reliable, and reproducible way of performing the customization. In a follow-up article, I'll show you how to automate some of these customization tasks by implementing them through SQL-DMO.