How to Restart Failed Log Shipping Quickly

Primer on Log Shipping

Log shipping: it's a process for maintaining a copy of an existing database on a different server by applying any new transaction log backup taken on the original database. The setup for log shipping is quite simple:

The original database (let's call it the primary copy) is in a logging state. You can't perform log shipping without transaction log backups after all.
You need a "seed" backup in order to create the target copy of the database. That target copy will need to be in a recovery state, meaning that it can accept additional backups (applied in an order that maintains the continuity of transactions intact - see this article on transaction logs, log sequence numbers, and recovery for more information.)
Four SQL Server Agent jobs that can be created automatically through a wizard or manually using T-SQL that will backup the transaction log on the host for the primary copy of the database, copy the backups to a network location of your choosing, and finally the restore job for applying the transaction logs against the target copy of the database. The additional job to watch for success/failure of any of the steps in the process.

A variety of issues may cause the log shipping process to fail. Some of these failures - such as backups not being taken on the primary database due to lack of disk space perhaps - are more critical than others. Regardless, when log shipping stops and you receive that dreaded alert from the SQL Server Agent if you created some form of notification scheme for the alert job, you need to be able to get log shipping up and running quickly. Odds are that in a log shipping situation you're taking frequent transaction log backups (every five minutes or even more frequently depending on transactional load). This means that even in a situation when you're oniline during work hours you may find yourself trying to resolve the underlying failure all while transaction log backups keep coming with no way of knowing which transaction log backup to apply next if simply restarting the restore SQL Agent job doesn't solve the issue.

This solution I'm about to show you specifically addresses the concern of not knowing which transaction log to apply after you have resolved the underlying issue(s) associated with a failed log shipping process and that restart of the SQL Agent target database restore job does not pick up where your log shipping left off.

It's All About the LSN

The LSN, or Log Sequence Number is the trail of bread crumbs that allows any recovery process in SQL Server to know the order of transactions to be applied. All recovery processes need to occur in this specific order to ensure that transactions are read from the transaction log and log backup files in such a manner that they were applied to the primary database originally.

You typically start at the beginning of the the book when you begin reading. (Unless you're so pessimistic about whether you'll live to get to the end of a book that you read the last page first. I know people like this.) The transaction log is read in the same fashion. It's not uncommon that occasionally log shipping will get hung up and forget where it was in the book - it loses its bookmark. This may be because the backups have stopped as mentioned earlier, or perhaps because the copy process went sideways for some reason and the restore process can't find the next log to apply to maintain backup chain of events. Once resolved it should be just the matter of kicking off the restore job in the target database's SQL Agent Job Activity Monitor but even that can not work. If so you need to find the next transaction log backup in the chain of events and jump start the process by either applying just that one log backup or the full backlog in secession then start up the log shipping restore job. Finding the next backup to apply that has the next LSN in the backup chain is the key to all of this and what this article provides insight towards.

Manually Restarting the Log Shipping Process

If you find that you need to jump start the next log file backup in the sequence to get log shipping re-initiated then you just need to identify the next backup in the chain to apply and ensure it's in the expected restore directory. Using the msdb.dbo.log_shipping_monitor_secondary table you can identify the last copied transaction log backup, the last restored transaction log backup, and then do a check to make sure the next transaction log backup that was taken on the primary instance is copied over to the target restore directory along with any others that follow should the copy job have stopped. Once that's done you should be able to manually restart the SQL Server Agent job for the pertinent log shipping restore process and be on your way:

SELECT last_copied_file
	, last_restored_file 
	, last_copied_date
	, last_restored_date
	, last_restored_latency
FROM msdb.dbo.log_shipping_monitor_secondary
WHERE secondary_database = 'sm_FOO';

You can then switch over to the primary database's SQL Server instance and query for the transaction log backup history for the database to determine the next transaction log that follows the last restored file:

WITH full_backups AS 
(
SELECT 
	ROW_NUMBER() OVER(PARTITION BY database_name ORDER BY database_name ASC, backup_finish_date DESC) AS [Row Number],
	server_name,
	database_name, 
	backup_set_id,
	backup_finish_date,
	last_lsn,
	media_set_id
FROM msdb.dbo.[backupset] 
WHERE [type] = 'L'
	AND [database_name] = 'sm_Panel'
)

SELECT TOP 10
	FB.server_name, 
	FB.database_name, 
	FB.backup_finish_date,
	BMF.physical_device_name AS primary_database_physical_device_name,
	'RESTORE LOG [' + FB.database_name + '] FROM  DISK = N''' + PD.backup_share + RIGHT(BMF.physical_device_name, LEN(BMF.physical_device_name) - LEN(PD.backup_directory))  + ''' WITH  FILE = 1,  NORECOVERY,  NOUNLOAD,  STATS = 10' AS restore_command_for_log_shipped_copy
FROM full_backups FB 
 INNER JOIN msdb.dbo.backupmediafamily BMF ON FB.media_set_id = BMF.media_set_id
 INNER JOIN msdb.dbo.log_shipping_primary_databases PD ON FB.database_name = PD.primary_database
WHERE FB.backup_finish_date > '2016-09-07 13:00:00.000'
ORDER BY FB.[Row Number] DESC;

That provides the following (excerpt of) results in order of backup_finish_date up to the most recent backup for the primary copy of the database:

Looking at the first record in that far right column (truncated due to page width constraints here) you'll get an idea of the dynamic SQL output:

RESTORE LOG [sm_FOO] 
FROM  DISK = N'\\SQL-DR-123\Backup\TxLogs\sm_FOO\sm_FOO_20160907200000.trn' 
WITH  FILE = 1,  NORECOVERY,  NOUNLOAD,  STATS = 10

So what exactly did we do in this process?

On the target database copy's instance we queried msdb.dbo.log_shipping_monitor_secondary to identify the last backup file restored.
Using that information we then moved to the primary database's instance to get a history of transaction log backup taken against that source database for all activity after that last applied backup. The script produces a dynamic restore statement that can then be run against the target instance to kick-start the log shipping process once again.
At this point you should be able to attempt to restart the log shipping restore SQL Agent job.

I do want to take a minute to break down that second query since there are quite a few moving parts in it. The objects involved are:

msdb.dbo.backupset: provides backup start and finish times for each backup file in the metadata stored in msdb. This system catalog view also provides us with the server name, database name, and media_set_id column which is necessary for a subsequent join to msdb.dbo.backupmediafamily.
msdb.dbo.backupmediafamily: speaking of dbo.backupmediafamily, this catalog view provides us with a critical piece of the puzzle: the physical file for the transaction log backups on the primary database.
msdb.dbo.log_shipping_primary_databases: this catalog view provides us with the log shipping information for the backup share (where the transaction logs are "shipped" to in order to be restored on the secondary (target) instance.

Conclusion

Log shipping can stop for a variety of reasons: changes in permissions in one of the many directories used in the log shipping process, space issues, server failure, and manual backups taken out of band are but a few. Instead of taking guesses on what backup file needs to be run to re-initiate log shipping rely on the metadata available in both the primary and secondary instances to make the process easily recoverable.

Comments

Plain text