Setting up and maintaining SQL Server replication can be a frustrating experience. Here are some strategies I've developed that you can use to deal with one common replication gotcha: verifying the subscriber's data. I'll show you how to use the SQL Server 7.0 inline replication validation capability to ensure that the subscriber is in sync with the publisher, and I'll show you how to discover problems before they cause serious trouble in your production environment.
SQL Server 7.0's replication validation capability is one of the coolest new features for administering a replication environment, but few administrators know that it exists. Also, SQL Server Books Online (BOL) doesn't fully explain how this validation feature works. The full replication validation process calculates a rowcount and checksum at the publisher, then compares the result to the rowcount and checksum at the subscriber to verify that the data is in sync. The process calculates one checksum value for the whole table, ignoring data in text or image columns. You can perform a rowcount-only validation, which is slightly faster than a validation that uses both rowcount and checksum. SQL Server 7.0 validation is an inline process, which means that validation doesn't interrupt transactional activity on the publisher or stop the ordinary flow of replication. (The exception is validating bidirectional transactional replication, which I'll discuss later in the article.) The validation process is somewhat different for transactional replication and merge replication, so I'll discuss each separately.
Validating a Transactional Publication
The validation process for a transactional publication starts by calling sp_publication_validation. You can run this procedure manually or set up a job to run it on a schedule. The sp_publication_validation procedure calls sp_article_validation for every article in the publication. If you want to validate only one article in a large publication, you call sp_article_validation directly rather than calling sp_publication_validation.
To see the results of the validation check, you need to run the logreader and the distribution agents, then look in the distribution agent's history log. Screen 1 shows an out-of-synchronization message in the distribution agent history log. Note that if you run sp_publication_validation in a query window, you will see a message such as Generated expected rowcount value of 3 for tab1. Ignore this message. It's a byproduct of the validation mechanism that is generated when the sp_table_validation procedure is executed at the publisher. But this message doesn't indicate whether the subscriber succeeded or failed the validation check.
Here's how the process works: The sp_article_validation procedure calls sp_table_validation, which calculates the rowcount and checksum for the published table. The sp_article_validation procedure then inserts an sp_table_validation call directly into the publisher's transaction log. The process passes the rowcount and checksum values as parameters to the sp_ table_validation call that is inserted into the transaction log. The ordinary transactional replication mechanism replicates the sp_ table_validation call to subscribers and validates the subscriber's data at the point when the subscriber should have the same data as the publisher. This step renders the validation procedure an inline process—a process that works within the usual flow of replication.
The sp_table_validation procedure raises a system message that reports the success or failure of the validation check at the subscriber. The distribution agent checks for the system message that sp_table_validation raises. Then the distribution agent raises the 20574 system message if the validation fails, or the 20575 system message if the validation passes. The distribution agent records the success or failure message in the distribution agent history log. Note that the distribution agent continues replicating transactions to a subscriber even if the validation check reports that the subscriber is out of sync.
Validating a Merge Publication
The validation process for a merge publication starts when the merge agent is run with the -Validate parameter set to 1 (for rowcount-only validation) or 2 (for rowcount and checksum validation). You can set these parameters by right-clicking the merge agent in Enterprise Manager and selecting Agent Properties, then selecting the Steps tab and editing the command line for the Run agent step, as Screen 2 shows. When the merge agent runs, it executes sp_table_validation against the publisher and subscriber and compares the values.
The merge agent raises the 20574 system message if the validation fails, or 20575 if it passes, then records the success or failure message in the merge agent history log, as Screen 3 shows. Note that the merge agent continues replicating changes to a subscriber even if the validation check finds that the subscriber is out of sync.
Automating Replication Validation and Setup Alerts
The mechanism you use to automate the validation process for a transactional publication is different from the mechanism you use for a merge publication. Also, in the case of a merge publication, you use a different mechanism based on whether the merge agent is configured to run continuously or only at scheduled times. For a transactional publication, you simply create a job that runs sp_validate_publication on a scheduled basis; for example, every night at 1:00 a.m. For a merge publication in which you configure the merge agent to run continuously, you set the ValidateInterval parameter to determine how often, in minutes, the process should validate the subscriber (the default is 60 minutes). For example, to validate the subscriber every 24 hours, set validate to 1 or 2 and ValidateInterval to 1440 minutes.
For a merge publication in which the merge agent is configured to run on a scheduled basis, such as every 5 minutes, you need to create a job that will change the validate parameter to 1 or 2 at the time when you want the validation to occur. For example, you can create a job that runs every midnight and uses the sp_change_agent_parameter procedure to change the validate parameter to 1 or 2. Then you can create another job to set the validate parameter back to 0 after the validation is complete. You can set up the second job to be triggered by the 20574 or 20575 alert, which fires when the validation is complete.
To automate a response to the validation check, you need to enable the preconfigured alert for message ID 20574 (failure) and message ID 20575 (success). You can configure these alerts through the Replication Monitor in Enterprise Manager. The alerts can take an action such as notifying an operator or recording a message in a log.
Validating Bidirectional Transactional Replication
Validating bidirectional transactional replication scenarios, including scenarios that use immediate-updating subscribers, is more complicated than validating ordinary transactional publications because users or processes are potentially updating the publisher and subscriber at the same time. To get an accurate validation check, you need to stop all users from making modifications directly against the subscriber while the validation check is performed. Users can still make modifications against the publisher and those modifications will be replicated to the subscriber while the validation is performed. You can use the 20574 and 20575 alerts to notify you when the validation check is complete.
Execution Times and Blocking
The validation process takes out a shared lock while it calculates the rowcount and checksum for a given table, but the calculation doesn't take long to run even on a fairly large table. I measured the execution time to run sp_table_validation on tables of different sizes. I performed the tests on a system with two Intel 333MHz Pentium II processors, 500MB of RAM, and one 9GB SCSI hard drive. Table 1 shows the test results. Note that it took only 30 seconds to calculate a rowcount and checksum against a 5-million-row (250MB) table in which the data was not preloaded into the data cache.
Limitations to Replication Validation
Here are some limitations to keep in mind when you use replication validation.
- The validation process excludes text and image columns when calculating checksum values. You can still run the validation on tables with text and image columns, but the checksum will be calculated only for the data in nontext and nonimage columns.
- So that the validation process can calculate a valid checksum, the table needs to have the identical structure at the publisher and the subscriber—the same columns in the same order, the same data types and lengths, and the same NULL/NOT NULL property.
- You can't use checksum validation with vertically filtered articles because the subscriber has only a subset of the publisher's columns, which results in different checksum values.
- Floating-point values can cause checksum differences if you use character-mode bulk copy program (bcp) to synchronize the subscriber because small, unavoidable differences in precision occur in the conversion between a floating-point number and the number's character-string representation. You can use the native-mode bcp to avoid discrepancies, or you can use a numeric or decimal data type instead of floating-point values. You use character-mode bcp whenever the publication has heterogeneous subscribers.
- To ensure that checksum values are accurate, you need to create the table at the publisher with one CREATE TABLE command. The reason is that the internal structure of the table might be slightly different depending on whether a user created the table with one CREATE TABLE statement or a CREATE TABLE command and a series of ALTER TABLE statements. The algorithm the validation process uses to calculate the checksum is sensitive to these internal differences, so it might return different checksum values for the publisher and subscriber even when the data is identical. Because the automatic synchronization process always uses one CREATE TABLE statement to create the table on the subscriber, the simplest solution is to also create the table at the publisher with one CREATE TABLE statement. (If you want to verify that the internal table structures at the publisher and subscriber are identical, you can compare the offset column values for each table in the syscolumns system table.)
Most serious SQL Server replication problems occur when the subscriber gets out of sync with the publisher. The users might see inaccurate data, or the replication process might fail because the data is out of sync. For example, the distribution agent can fail, causing a duplicate key error. By becoming familiar with SQL Server 7.0's inline replication validation capability, you can detect synchronization problems before they wreak havoc in a production environment.