Advanced BACKUP and RESTORE Options

A lot of information is available about planning backup and restore strategies, as well as using the BACKUP and RESTORE commands. Although most DBAs are familiar with these commands’ standard options, many don’t realize that some of the lesser-known options can be really useful. With that in mind, I’ll introduce you to some of these infrequently used options and explain how they can benefit you.

I’m going to assume that you’re familiar with concepts such as backups, restores, data and log files, and transactions. If you want a refresher, take a look at the training videos on backup internals and restore internals that I recorded for Microsoft last year.

Verifying Backup Integrity

One of my wife’s favorite sayings is, “You don’t have a backup until you’ve restored it.” I have to agree. Verifying backup integrity is something that’s woefully neglected by the DBA community. When helping out with corruption cases on various forums, I’m often told that the backups are also corrupt.

It’s a little-known fact that you don’t have to actually restore a backup to make sure it isn’t corrupt in some way. SQL Server 2005 introduced page checksums as a way of allowing SQL Server to detect whether data file pages have been corrupted when outside of SQL Server’s control (most commonly by the hardware or software comprising the I/O subsystem). Page checksums are enabled by default for databases created on SQL Server 2005 and later. You need to enable them when upgrading databases from an earlier version.

The page checksum is checked whenever a data file page is read into memory for processing. This doesn’t mean that all corruptions will be detected, however. Many workloads don’t regularly read all allocated pages in the database, except for consistency checks and full backups. Consistency checks cause page checksums to be checked. However, by default, full and differential backups don’t cause page checksums to be checked.

You need to use the BACKUP command with the CHECKSUM option to make sure page checksums are checked. This option is extremely useful, because it notifies you when your backup has corruption in it. Furthermore, if a page corruption is found, the backup will fail (again by default) with a message similar to that in Figure 1. When you use this option, the backup process will also calculate a checksum over the entire backup and store it in the backup header.

Figure 1: Error message noting a failed backup due to page corruption

Alternatively, some DBAs use the RESTORE command with the VERIFYONLY option to check the integrity of a backup. Unfortunately, this approach only looks at the backup header and not the data in the backup. You need to use the RESTORE command with both the CHECKSUM and VERIFYONLY options. Only then will the restore process recheck all the page checksums and recheck the checksum for the entire backup. Note that you can’t verify the page checksums of a backup that didn’t use the CHECKSUM option when the backup was performed.

Using the CHECKSUM option during backups and performing backup verification should be a regular part of testing your backup strategy—except if you actually restore your backups to test them. The extra CPU usage from generating the page checksums is miniscule and shouldn’t increase the backup operation’s duration. You can quickly ascertain whether backup checksums are being used by looking at the has_backup_checksums column in the backupset table in msdb or in the HasBackupChecksums field in the output from the RESTORE HEADERONLY command for any single backup. Check out my blog post for more information and examples.

Minimizing Data Loss When a Disaster Occurs

When a disaster occurs, the first reaction of a DBA is often to fail over to a redundant system or start restoring from backups. As soon as an in-place restore operation is started, the possibility of backing up the very last portion of the transaction log disappears forever. This kind of backup is known as a tail-of-the-log or tail-log backup because the portion of the log that hasn’t been backed up is known as the log tail.

Unless you’re using synchronous database mirroring or some form of synchronous SAN replication, restoring a tail-of-the-log backup is the only way to get the absolute minimum amount of data loss following a disaster. Tail-of-the-log backups are always possible in the full or bulk-logged recovery model, with one exception: when a minimally logged operation has occurred since the last log backup and the data files are damaged or missing. In that case, because the log backup would require also reading the data file extents changed by the minimally logged operation, a tail-of-the-log backup isn’t possible. This means that any time there are user transactions that can’t be lost, you should avoid the bulk-logged recovery model, because you might encounter a situation in which the log backup isn’t possible, which means you’ve lost data.

If you try performing a tail-of-the-log backup and the data files are missing or damaged, you’ll get an error message similar to the one that Figure 2 shows. In that case, you can use the NO_TRUNCATE option with the BACKUP statement, which will allow the log backup to proceed. This works because the database metadata that tracks backups is mirrored in the file header page of the log file. Even if the entire SQL Server instance is unavailable, you can take the surviving log file to another SQL Server instance, attach the log file to that instance, then perform a tail-of-the-log backup. I explain how to perform normal and instance-unavailable tail-of-the-log backups in my blog post "Disaster 101: Backing up the Tail of the Log".

Figure 2: Error message noting a failed backup due to missing or damaged data files

Note that in SQL Server 2005 and later, a database that has a log tail that hasn’t been backed up can’t be overwritten by a restore operation unless you use the REPLACE option. This is a great feature for preventing accidental data loss.

Improving Backup Performance

Most DBAs don’t realize that they can substantially improve the performance of backup operations. This is beneficial for multiple reasons, including:

It reduces the time that the extra I/O workload (and potentially CPU workload for compressed backups) is present on the system.
It reduces the potential for the transaction log file to have to grow to accommodate the transaction log generated while a backup operation is running, especially for long-running full database backups.

The easiest way to improve backup performance is to allow the backup operation to parallelize, which is known as backup striping. By default, there’s a single data reader thread for each drive letter or mount point being read from and a single data writer thread for each backup device being written to. In general, having more reader and writer threads should improve performance up to the point where the I/O subsystem becomes a bottleneck either for reads or writes. This saturation point will vary, depending your system. The SQL Server Customer Advisory Team (SQLCAT) article “Tuning the Performance of Backup Compression in SQL Server 2008” contains a comprehensive discussion of backup striping, along with a wealth of test data and graphs showing the gains possible.

A more advanced way to improve backup performance is to manually specify the number of backup I/O buffers to use (with the BUFFERCOUNT option) and the size of each buffer (with the MAXTRANSFERSIZE option). Again, the best combination of these options is going to vary by system, but significant gains can be made. Care must be taken when modifying these values, because the amount of virtual address space required (in addition to that used by the buffer pool) will be BUFFERCOUNT ´ MAXTRANSFERSIZE, which can lead to out-of-memory errors on 32-bit systems.

There is obviously a trade-off here, which Jonathan Kehayias covers in-depth in his blog posts at "An XEvent a Day (17 of 31)" and "An XEvent a Day (18 of 31)". He includes links to even more material from Microsoft Customer Service and Support (CSS).

Minimizing Downtime When a Disaster Occurs

One of the usual goals of a restore operation is to restore as little as possible to reduce the amount of downtime required. With the advent of piecemeal restores, this became much easier. Online piecemeal restores help even more. The phrase piecemeal restore means restoring only a subset of the database, such as a single-page restore (using the PAGE option) or a single-file restore (using the FILE option). Online piecemeal restores allow restoring a portion of a database while the rest of the database is online. Partial database availability makes online piecemeal restores possible.

One feature I see few people using is single-page restore. With the exception of database-wide allocation bitmaps and unique pages such as file header and boot pages (the same restrictions that limit automatic page repair with database mirroring), all pages can be single-page restored as long as there’s a way to bring the restored page up to the current time. This means you must have transaction log backups available. Some pages (like those storing table metadata) might fail with an online restore but succeed with an offline restore.

If your consistency checks tell you that you have a single corrupt page (or even a small number of corrupt pages), it will be far faster to restore them using page restores than to restore the file or file group that they’re part of. You can read more about page restores in the SQL Server Books Online (BOL) topic “Performing Page Restores” and in my MSDN blog “Fixing damaged pages using page restore or manual inserts”.

Sometimes it might be necessary to restore an entire very large database (VLDB), but even then you can reduce the amount of time it takes to bring that database online by architecting for partial database availability and using online piecemeal restores. In this case, you’re restoring just a portion of the database and bringing it online so that you don’t have to wait for the entire database to be restored.

Introduced in SQL Server 2005, partial database availability and online piecemeal restores are available only in the Enterprise Edition. Partial database availability is the technology that lets you have parts of the database offline and continue processing using the online portions. Online piecemeal restores let you restore the offline parts while keeping the database online.

For example, suppose a VLDB storing current and historical sales data is destroyed. If the VLDB has a single data file, restoring the database to a point that the sales application can work properly requires restoring the entire VLDB. However, if the VLDB has been architected to store different portions of data in different file groups (e.g., current sales in one file group, 2010 sales in another, 2009 in another, and so on), the database can be partially restored and brought online without requiring all the data to be restored. This involves restoring the PRIMARY file group using the PARTIAL option, then restoring the current sales file group, after which the database can be brought online. Later, online piecemeal restores can be performed to restore the other file groups containing historical data. As you can imagine, this can lead to some huge savings in application downtime.

You can read more about partial database availability in the SQL Server BOL topic “Performing Piecemeal Restores”. I provide a sample script for this type of restore in the SQL Server Questions Answered blog “Using Partial Database Availability for Targeted Restores”.

Restoring to a Point in Time

Occasionally you might need to restore a database to a particular point in time that isn’t the most recent time possible. A sample scenario is when someone has deleted some data and you need to restore the database from a log backup created right before that deletion. If you know the time the deletion occurred, you can simply use the STOPAT option to specify that the log restore shouldn’t proceed past the time specified.

If you don’t know when the problem occurred, you can perform an operation I call “inching through the log.” This restore operation uses both the STOPAT and STANDBY options to progress through the log by a small amount each time to see whether you’ve reached the correct point. The webcast “Part 10: Recovering from Human Error (Level 200)” walks through this scenario (as well as a few more).

An alternative is to use marked transactions. A marked transaction creates a log mark in the transaction log. An example is

BEGIN TRANSACTION PaulsTran WITH MARK;

You can then use this known point when restoring a database. If you want to restore the database up to and including the marked transaction, you use the STOPATMARK option. If you want to restore the database up to but not including the marked transaction, you use the STOPBEFOREMARK option.

To use either option, you need to know the name of the log mark. If you haven’t manually recorded them, they’re stored in the logmarkhistory table in msdb—which is one more reason to make sure you have backups of the msdb database. If you lose the information in the logmarkhistory table, it’s difficult to find the log mark names manually. You’d have to use a third-party tool or use undocumented log and log backup commands.

If you used the same log mark name multiple times, you need to tell SQL Server which one you want to stop at with the AFTER option. Otherwise, it will stop at the first log mark matching the name specified in the RESTORE statement.

Typically, marked transactions aren’t used when only a single database is involved. However, it’s the only reliable way of being able to restore multiple databases to the same transactionally consistent point, either on one instance or multiple instances. If all the databases are on a single instance, the log mark will be automatically inserted into the transaction log for each database. For databases on remote instances, a distributed transaction must be used to force the log mark into each transaction log. Details of this procedure are in the SQL Server BOL topic “Using Marked Transactions (Full Recovery Model)”.

Restoring all necessary databases to the same point can be done by specifying the same log mark name (and AFTER , if necessary) for each database’s restore sequence. You can read more about using marked transactions during a restore in the SQL Server BOL topic “Recovering to a Marked Transaction”.

Restarting an Interrupted Restore Operation

One the RESTORE command’s least-known options is WITH RESTART, which lets you restart an interrupted restore operation. Periodically, a restore operation writes a checkpoint file that describes the point to which the restore has progressed. Checkpoint files (which have nothing to do with regular database checkpoints) are written to the \InstanceName\MSSQL\Backup folder.

A checkpoint file is updated when:

Database file creation and zeroing is completed
After each backup set is processed
The redo part of a recovery is finished

If the checkpoint file exists and you use the WITH RESTART option, whatever steps have already been completed will be skipped. If you use the WITH RESTART option and there isn’t a checkpoint file, you’ll receive a message similar to the one in Figure 3.

Figure 3: Error message noting that a checkpoint file wasn’t found

This option is really meant for large backups that span several tapes; it lets you restart a restore operation without having to go back to the first tape. However, it can also be useful for restoring disk-based backups. For instance, if your restore operation successfully created and zeroed the data and log files but was interrupted before copying could be completed, the WITH RESTART option will prevent the restore from having to go through the earlier operations again, saving precious time.

The Most Beneficial Option

I’m often asked which of the lesser-known BACKUP and RESTORE options is the most beneficial. You can get tremendous performance gains by using the BUFFERCOUNT and MAXTRANSFERSIZE options to tailor the I/O buffers. Using the PAGE and PARTIAL options to perform single-page and piecemeal restores can dramatically reduce downtime when disaster strikes. However, if I had to select only one option, I’d single out using the NO_TRUNCATE option to perform tail-of-the-log backups. It’s the only way to ensure that you can recover right up to the point of the disaster if the data files are unavailable.

I hope that you’re now curious about some of these lesser-known BACKUP and RESTORE options. Investigating them further could be advantageous to you and your company.

Comments

Plain text