Recovering from Isolated Corruption

Downloads
39657.zip

Database disasters are relatively few and far between, but when they occur, they can be devastating. Kalen Delaney discusses how you can prevent the loss of data and productivity after a disaster in "Disaster Prevention: Preparing for the Worst," page 24. But in this article, I talk about recovering from isolated disasters—specifically, failures that affect only part of your database. For example, maybe you lost a subset of data such as a table or part of a table. Although restoring the full database is always an option for recovering data (and sometimes your only option), certain backup strategies and third-party products might help you handle this type of disaster more efficiently and effectively.

Before discussing those strategies, however, let's look at what might cause an isolated disaster. One possible source is hardware failure. But a more likely cause of data loss and, ultimately, downtime is human error. Perhaps a DBA accidentally dropped a table or mistakenly deleted or updated rows because of an incorrect WHERE clause. Or maybe a developer didn't properly test an application's code, so users were able to access and manipulate the wrong data. Recovering from isolated corruption caused by human error can be more difficult than recovering from a hardware failure because you might not know exactly what was damaged, when the damage occurred, or the cause or extent of the damage. When you don't know that information, your recovery can be cumbersome, time-consuming, and prone to additional errors and further data loss. The cause of the disaster determines how you recover from it, so let's look at the ways you approach recovery after hardware failure and after failure caused by human error.

Recovering After Hardware Failure

As soon as your system sustains an isolated hardware failure, such as a disk failure, your database automatically becomes inaccessible and marked suspect. When a media failure occurs, the damage is easier to repair because you have only one recovery option: find another place suitable for the failed section of the database to reside. This spot can be a new permanent location or a temporary location until hardware is replaced or installed. Note that recovery after a media failure is much faster when you keep additional hard disks on hand or set aside supplementary disk space to which you can immediately restore the data by using the WITH MOVE command. Otherwise, you might have to order and wait for the replacement disks. After deciding on a location to which you can restore your data, you need to determine how you can best recover the data and to what point in time you can recover it. Your choices depend on your backup strategy, your recovery model, and your knowledge about what has failed.

When you lose only a data file, you might be able to recover the data up to the point of the failure with no data loss. This up-to-the-minute recovery is possible because SQL Server stores within the transaction log detailed information about the data that has changed. To achieve up-to-the-minute recovery, you must be able to access this detailed information in the transaction log of the damaged database and perform a transaction log backup. However, up-to-the-minute recovery might not be possible, and data might be lost in certain scenarios such as the following:

You've lost the transaction log file or a part of it.
You're using the Simple recovery model in the damaged database, and you perform only periodic full database backups.
You're using the Bulk_Logged recovery model in the damaged database, and you've performed a bulk operation since your last transaction log backup.

In these situations, the tail of the transaction log (i.e., the transaction log of the changes up to the point of corruption) won't be available to be backed up, so you can recover data only up to the last backup. As a result, you need to manually rerun or reperform all the changes that occurred after the backup.

You can minimize data loss by running frequent transaction log backups (e.g., every 5 minutes). You can also impose restrictions on the amount of time a database can remain in a bulk-logged state by implementing the changes through batch processes, which programmatically change the recovery models. However, you should take time to determine your best log backup plan as well as your recovery model. (You can read my Web-exclusive article "The Best Place for Bulk_Logged," http://www.sqlmag.com, InstantDoc ID 39782, for details about some of the problems related specifically to the Bulk_Logged recovery model.)

When a database becomes suspect, you need to decide how to recover your data. Figure 1 shows a flow chart that details the first steps you should take after a database becomes suspect. If you determine that the tail of the log is accessible and that you can back up the transaction log, the first step is to back it up by using the NO_TRUNCATE option on the transaction log backup. This option is designed specifically for a situation in which the data portion of the database is inaccessible but the log can still be backed up. Your next step is to determine the proper recovery path. Depending on the types of backups you've performed, you might be able to reduce the total amount of recovery time by choosing the proper order and type of backups. SQL Server provides two main recovery strategies: the full database backup strategy and the file- or filegroup-based backup strategy. Both strategies let you recover only the damaged file or filegroup. In the full database backup strategy, you perform full backups and transaction log backups regularly (e.g., full database backups weekly and log backups hourly). Differential database backups are optional; however, they can often minimize recovery time. Remember, a differential backup is similar to a full backup but includes only the extents that have changed since the last full database backup. During recovery, you need to restore only the last differential backup instead of restoring the entire sequence of log backups that cover the same time frame.

In the file- or filegroup-based backup strategy, you don't perform full database backups directly. Instead, you perform them indirectly through more granular file and filegroup backups. Using the file or filegroup backup strategy is more complex than using the full database backup strategy because you perform more backups overall, and each backup is only a subset of the database. However, you can recover an inaccessible subset of the database more quickly by using the file-based backup strategy. For more information about file and filegroup backup strategies, you can review my September 2002 article "Before Disaster Strikes," InstantDoc ID 25915, and its sidebars "Restoring After Isolated Disk Failure," InstantDoc ID 26067, and "Log Backups Paused for Good Reason," InstantDoc ID 26032. In "Restoring After Isolated Disk Failure," be sure to note the problem with setting the read-only attribute for filegroups. This problem still applies in SQL Server 2000 Service Pack 3 (SP3) and SP3a.

So how does recovery after hardware failure proceed? The following steps are important to ensure the fastest recovery based on the type of backup strategy you've chosen:

Determine whether the tail of the log is accessible. If it is accessible, back up the tail of the log.
Determine how many files have been damaged.
Recover the damaged portion of the database. If you're using the full database backup strategy and only a few files have been damaged, use the FILE = or FILEGROUP = syntax on the RESTORE command to restore only the damaged files or filegroups. You don't need to restore the entire database. If you're using the file or filegroup backup strategy, you need to restore only the appropriate files or filegroups.
Optionally, use differential backups to advance the damaged files or filegroups to a more recent point in time.
Roll forward the database by using transaction log backups.
Roll forward and recover the database by using the tail of the transaction log. Be sure to use RESTORE LOG...WITH RECOVERY. If you use RESTORE WITH NORECOVERY, you can bring the database online by using RESTORE DATABASE dbname WITH RECOVERY without restoring any backups.
Optionally, and at any point later in time, if you restored the database or part of it to a temporary location, you can move the database to a new location. To move the database later, you need to detach the database, move the files, then reattach the files. For more information about moving databases, see the Microsoft article "Moving SQL Server Databases to a New Location with Detach/Attach" at http://support.microsoft.com/default.aspx?scid=kb;en-us;224071.

Recovering After Human Error

If human error caused the isolated corruption, the recovery problems are more difficult than in the case of hardware failure. As I mentioned earlier, you might not know when or how the damage occurred or the extent of the damage. (The sidebar "An Ounce of Prevention" describes how you can help minimize or prevent human error.) When you become aware of the corruption, the first thing you need to find out is whether the damage has been propagated to a secondary server so that you can decide how to proceed. If you use high-availability technologies such as SQL Server log shipping, you can configure your secondary server to apply the transaction logs after a configurable delay to protect against corrupting the data on the secondary server. And if you use two standby servers, you can configure one server to receive data in realtime for use after a site failure or other catastrophic failure and configure the other standby server to receive data after a defined latency for use when isolated damage has occurred. However, if your only standby server applies transaction logs as soon as it receives them from the primary server, both your primary and standby servers will suffer from the same damage.

Once you know whether the damage has affected your secondary server, you can better assess your options. Here are the most important and time-critical steps to take:

Decide whether to take your database offline and whether to switch users to a secondary database (or server).
Decide whether users should have access to the production database (if you didn't switch the users to a secondary server).
Determine the scope of the problem.
Consider your restore options.
Recover the data.

For each of these steps, you can choose from multiple options, each with its own pros and cons. Take time to review all the following options and choose your best disaster-recovery plan.

Granting user access to the database. This step is important because it lets you prevent further corruption and possibly keep your database available for users so that your business doesn't experience downtime. In the case of human error, SQL Server doesn't mark the database suspect and prevent users from accessing it as it does when the database has suffered hardware failure such as a damaged disk. You become aware of the database damage only because someone or something such as an error message notifies you of incorrect, missing, or corrupt data.

If you don't have a secondary database that you can switch users to, you can begin your recovery process by either restricting access to the database or making the database read-only. You can restrict access by running the following ALTER DATABASE command:

ALTER DATABASE dbname SET RESTRICTED_USER

Databases marked RESTRICTED_USER allow access only to members of the db_owner role or to system administrators. To make the database read-only, which lets users see but not modify the data, you issue the following command:

ALTER DATABASE dbname SET READ_ONLY

Choosing whether users should have access after the database has been damaged depends on the data, the application, and the level of damage. After you've determined that users can't further damage the data, you can begin to investigate and recover from the disaster.

Scoping out the problem. Probably the most difficult part of the recovery process is determining the full scope of the problem. If you know how the information was damaged, you probably also know the scope of the damage. However, if you don't know who caused the damage or how the damage occurred, you need to investigate the damaged database to determine the breadth and depth of the problem. You can investigate the problem by using the SQL Server RESTORE WITH STANDBY command or by using third-party tools.

Typically, to investigate the problem, you restore an earlier version of the database to an alternate location (on the same server or elsewhere) and methodically verify whether the version restored includes the damaged data. The trick to investigating a corruption problem is to use a variety of restore options to restore the database in a read-only state, then continue to restore additional transaction logs or parts of logs. In this standby state, you can review the data between each log application. However, another trick, which you should perform before you apply each log, is to copy an unadulterated version of the data to a third (yes, a third) database. This step is crucial because you save a good version of the data before you apply the next log—and possibly the damaging statements. After you've applied the damaging log, you can't roll back to the earlier version of the data without starting over and re-restoring the entire sequence up to the last good version of the log. You can use the following steps to investigate the source of the problem:

Restore the latest full database backup that you believe includes only undamaged data. If you're using the full database backup strategy, you can restore the entire database or only the damaged files or filegroups by using RESTORE WITH STANDBY. If you're using a file- or filegroup-based backup, you need to restore the entire database because SQL Server doesn't support RESTORE WITH PARTIAL as part of file or filegroup backups.
If you're using differential-based backups, restore the latest database differential backup that you believe includes only undamaged data. Again, if you're using a full database differential backup, make sure to RESTORE WITH STANDBY so that you can continue restoring transaction logs. Read from the restored version of the data and determine whether you've restored a version that's undamaged. Use INSERT/SELECT to copy the restored and undamaged data into the third database (the temporary holding area) so that you can recover from this database if necessary.
Use WITH STANDBY to restore each transaction log, either entirely or by using WITH STOPAT to restore a subset of the transaction log, and slowly review the data from the damaged table. After you've determined that the data is undamaged, TRUNCATE the holding area's data, then INSERT/SELECT the uncorrupted data into the holding area. Your objective in this process is to save the latest version of clean data so that you can copy it into the production database if you so choose. (You work with multiple options in recovery, but a clean version of the data might come in handy.) Repeat this process for each transaction log to get as close as possible to the time the damage occurred. Use STOPAT to specify a time within the log to which the restore will stop instead of restoring the entire contents of the transaction log. By taking graduated steps through the transaction log and reviewing the data, you can verify the state of the data at various incremented points in time. I'll use scripts and multiple restore commands to walk you through a detailed investigation example in a moment.

Considering the restore options. After finding out the scope of the problem, you're ready to choose the most appropriate way to recover the database. Fundamentally, you have two options: recover the entire database to an earlier point in time or recover the damaged data to another location so that you can manually manipulate an earlier and undamaged version of the data back into the production database. Both options carry advantages and disadvantages. If you roll back the entire database, you don't risk data-integrity problems because your database is restored to a consistent point in time. However, depending on how much time has passed since the data became corrupted, you might suffer a significant data loss by restoring only to an earlier point in time. When you restore the database to an earlier point in time, you eliminate all the changes that happened after that point in time.

The second option—manually merging the data—can be error-prone and time-consuming, depending on the type of data that is damaged. When you manually merge data into a database, many other dependencies might also need to change. For example, if a table has foreign keys with cascading constraints, data outside the damaged table is also likely corrupted. This is the whole point in learning as much as you can about the scope of the problem. If you choose to recover the damaged data to another location and manually merge undamaged data into the production data, you need to make certain that all database developers, DBAs, and other stakeholders take time to manually verify the data before you reinstate user access.

Recovering the database. The guiding rule of recovery is to use the most efficient path to recover your database. Before you start, here are a few rules to follow. Don't use the RECOVERY clause for the RESTORE command until the final log is ready to restore. If you recover a database too soon, you will need to start the recovery process again, beginning from the first restore in the recovery sequence. Using NORECOVERY on every restore, even the last restore, is safer because you can recover a database without executing a restore. If all backups have been restored and if you used NORECOVERY on every restore, you can use the command RESTORE DATABASE dbname WITH RECOVERY to "recover" a database (roll back any incomplete transactions) without restoring anything. If no options are specified on the RESTORE command, the default option is RECOVERY.

Another rule of thumb is to restore only the files or filegroups that are damaged. Don't restore the entire database unless you have to. Finally, if you choose to restore the production database to an earlier point in time, use WITH STOPAT to control where the recovery stops. If you decide to manually merge data between the holding databases and the production database, you can use INSERT/SELECT to copy the data to the production database most effectively.

Investigating the Damage

Now, let's walk through an investigation that uses the RESTORE WITH STANDBY option. In the sample scenario that Figure 2 shows, a database has seven files: one in the primary filegroup, three in the read/write filegroup (RWFG), two in the read-only filegroup (ROFG), and one for the transaction log. Let's say you performed a full database backup at time-point 1, a transaction log backup at time-point 2, a differential database backup at time-point 3, and a transaction log backup at time-point 4. The associated SetupTestCaseforPartial&Stopat.sql executable script (which you can download at http://www.sqlmag.com, InstantDoc ID 39657) creates backups that apply to this case study. Between each of the backups, the script modifies data so that each backup has something interesting to back up.

At time-point 4, you realize that corruption occurred in an RWFG filegroup table. You're not sure exactly when the corruption occurred, and you want to work your way through the restores to determine this information. You decide that there's enough room on the same server to restore the RWFG (and the primary filegroup and log), but you want to restore them to a different database and different files. This process requires you to use many of the parameters on the RESTORE command.

To restore this database to a new location, use the WITH MOVE option on the RESTORE command. To restore only the RWFG filegroup, you specify only the filegroup name in the RESTORE clause, then specify PARTIAL in the WITH clause. To make certain that you can continue to apply additional backups to this partially restored database, you also use the STANDBY clause during RESTORE and specify an UNDO file. The UNDO file is where SQL Server places transactional information for the open—and therefore pending—transactions. You place pending transactions in the UNDO file at the end of the restore so that SQL Server can apply (redo) the transactional information as you apply the next transaction log. Listing 1 shows the complete syntax for this restore.

You can restore the differential backup after you restore the database if you're sure that the damage occurred after you performed the differential backup. However, if you're not sure when the damage occurred, using the individual log backups will give your restores better granularity so that you can work more slowly through smaller restore time frames. In this example, I chose to restore each minute individually. Table 1 shows when the backups were performed and consists of modified output from the RESTORE HEADERONLY command. This command is a helpful way to list the number, time, and type of backups that exist on a backup device or media set.

The full database backup in the example finished quickly because the database had just been created and didn't contain much data. The first transaction log backup occurred roughly 3 minutes later, and the differential database backup happened 3 minutes after that. The final transaction log backup took place 3 minutes after the differential backup. The best way to work through the restores is to pick a time and move forward from that point. Because you might not be sure of when the damage occurred, you move through the transaction log backups by using both the STANDBY and STOPAT options on the RESTORE command. To restore the first minute of the first log, you specify a time. The example log was backed up at 13:40:30 and includes all the changes since the last log backup. However, because this is the first log backup after the database was created (and backed up), it really includes only information since the last full backup, which occurred at 13:37:09. The next transaction log backup contains the changes that occurred between the first log and this one. When you restore, you can choose any time between 13:40:30.000 and 13:47:12.000, and if you prefer, you can restore some or all of this transaction log. If you need to restore only part of the transaction log, you can restore each minute or every 30 seconds worth of information. Although restoring at this level of granularity can become time-consuming, you might find the detail invaluable in some situations. The Recover&InvestigatewithPartial&Stopat.sql executable script file, which you can also download at InstantDoc ID 39657, shows the syntax for each of the restores. Note that you need to modify this script with current times if you plan to execute it. For each restore, you specify STOPAT and STANDBY and an UNDO file to work your way through the data, investigating the state of the data each time. The following code snippet shows the first transaction log restore and the syntax, which includes STOPAT, STANDBY, and instructions for specifying an UNDO file with STANDBY =:

RESTORE LOG NewPubsTest
FROM <location of backup device(s)>
WITH FILE = 2, — The backup 
— number by position.
  STOPAT = '2003-06-20 13:38:00',
STANDBY = <location for UNDO file>

Although the process for stepping through the log backups is a bit tedious, you can determine exactly how far to go with each log application. Between each application, you can select the data, review it, and determine when the problem occurred. Saving the data into another database between loads can also help set you up for a faster recovery if you choose to manually merge the data into production rather than bring the entire production database back to an earlier point in time.

As long as you're performing transaction log backups, you can use RESTORE WITH STANDBY to work through restores at a more granular level—no matter which type of backup strategy you use. However, if your situation meets all the following three criteria, you can restore only the damaged portion of the database in an alternate location and investigate the problem by using the WITH PARTIAL option on the RESTORE command:

You've implemented the full database backup strategy—that is, you performed full database backups and transaction log backups. Differential backups are optional.
Your database is separated into multiple files and filegroups.
You want to restore only the desired file or filegroup.

Note that if you choose to restore a subset of a database by using RESTORE WITH PARTIAL, SQL Server also requires you to restore the primary filegroup.

Using Third-Party Tools

You can also use third-party tools to investigate a corrupted database. Some of these tools offer options that let you reverse-engineer the transaction log and regenerate the T-SQL statements that performed the operations. These tools often can tell you which user performed which operation at what time. You can then use the reverse-engineered scripts to "change history," using the regenerated scripts as your form of recovery instead of restoring the transaction logs to the database. However, using these scripts could compromise data integrity—especially when you change history. If you remove some commands from the script but continue to run commands that followed the now-removed statements, the statements that follow might generate different results and different data. Additionally, other users might have based their work on information that will be reversed by these modified scripts. These tools can be invaluable, however, for finding and helping resolve the problem quickly and for possibly providing you with information that shows how you can avoid this isolated corruption in the future. For example, you might see an application vulnerability that you can close after the problem is solved.

However, the biggest advantage of these tools is that you can use them to read transaction logs after the changes have been committed. In contrast, SQL Server doesn't provide a mechanism for reading the transaction logs. When you use SQL Server tools, you can implement some kind of realtime auditing to determine who caused a problem and when the problem occurred, but you can use only triggers, a custom application, or tools such as SQL Server Profiler. With SQL Server's recovery capabilities, you must already know something about what was damaged and what to look for, but with third-party tools, you can go straight to the transaction log—which is more efficient.

The two primary third-party tools for investigating the transaction log are Lumigent Technologies' Log Explorer (http://www.lumigent.com), which Ron Talmage reviewed in "Log Explorer 3.03," January 2003, InstantDoc ID 27272; and BMC Software's SQL-BackTrack, which you can read about at http://www.bmc.com/products/proddocview/0,2832,19052_19429_23365_1058,00.html. Numerous links on the SQL Server Magazine Web site take you to product reviews, recommendations, and gotchas for many different products. You can download and test the free versions of available tools to decide whether they meet your requirements.

Evaluating Your Recovery

After you've recovered the data damaged in an isolated corruption, make sure you take stock of the event. Is there a place where security or application design was violated? In addition, ask yourself whether the recovery went as planned. What went wrong? What went well? What can you do to streamline the next recovery? And most importantly, evaluate what steps you can take in the future to prevent the same sort of event from happening.

Implement strategies that will prevent hardware failure in the future, and consider adding some form of hardware redundancy to minimize problems that might occur. And although you want to prevent human error as much as possible, recognize that no system is truly safe from every accident or mistake. By knowing your options for recovery, however, you can minimize downtime and return the system to a usable and stable state. Develop a few disaster-recovery plans that fit your network, test and document them, then update them as data, recovery models, or other related factors change.

Comments

Plain text