SharePoint 2010 Disaster Recovery, Part 2

In Part 1 of this article (“SharePoint 2010 Disaster Recovery, Part 1"), I discussed the various types of disasters that can befall your Microsoft SharePoint Server installation, as well as techniques to protect against those disasters. Part 1 focused on how to plan for and recover from content deletion disasters; in that discussion, I made the assumption that the infrastructure was functioning. In this article, I cover what happens when the infrastructure itself fails. This type of disaster includes machine outages, such as a SharePoint server crash or a Microsoft SQL Server machine crash, as well as facility failures. I also explain how to make your SharePoint farm highly available, including measures you can take to prevent the types of disasters I discuss.

Background

Any good technical article provides some background to help frame the discussion. In Part 1, I discussed the need to have a service level agreement (SLA) in place to define your disaster recovery expectations.

You also need a well-defined recovery time objective (RTO), which is a guideline for how quickly you must get SharePoint back online after a disaster strikes. For example, your RTO might state that when SharePoint goes offline, your objective is to get it back online in 4 hours. Having a defined RTO helps you shape your disaster recover strategy, as well as sets your customers’ expectations. A good rule of thumb is to round up. It’s better to under promise and over deliver than to over promise and under deliver. Sometimes the key to success hinges on lowered expectations.

Another important factor in ensuring successful disaster recovery is your recovery point objective (RPO), which defines the data that comes online at the time of recovery. In Part 1, I discussed RPO in the context of the point from which documents can be restored. For example, is the RPO midnight, when the backups ran? Is the RPO “no more than 2 hours old”? The RPO specifies the latest point in time to which we can recover.

To illustrate these points, it’s helpful to use an example. Suppose your RTO is 2 hours and your RPO is midnight of the previous day. If someone calls you at 1:00 p.m. to report missing content, you have until 3:00 p.m. (i.e., your RTO of 2 hours) to restore the data, which will be no older than from midnight of the previous night (i.e., your RPO).

Having a well-defined RTO and RPO is imperative to planning an appropriate disaster recover strategy. If you’re performing backups only at midnight and your RPO states that your customers will never lose more than 4 hours of work, then you have a conflict. To meet your RPO, you must increase your backup frequency—which will cost money at the very least, as well as possibly decrease performance. However, these tradeoffs are necessary to meet your objectives. In most cases, the shorter the RTO or RPO, the more money and management time it takes to achieve.

Machine Failure Outages

Now that we’ve covered the basics, let’s get down to the technical aspects of disaster recovery. The very least you can do in order to recover from a machine failure is to back up your databases. As the old saying goes, “Content is king”—and if you have all your databases, you have all your content. If you’re not already performing backups, rest assured that getting started is easy. (Remote Blob Storage—RBS—presents a database backup problem; for more information, see the sidebar “Remote Blob Storage Affects Database Backup.”)

For a crash course in performing backups of all your SharePoint databases, see my blog post “Scheduling SQL backups for SharePoint.” While you’re at it, go ahead and back up your SQL Server databases as well. These backups won’t take up much space, but they’ll really come in handy if you have to rebuild your SQL Server instance from scratch. Not only can you use these database backups to recover individual items, as I discussed in Part 1, but you can also use them to recover site collections, web applications, service applications, or an entire farm. Let’s walk through a few scenarios to see how.

SharePoint crashes. Suppose you have a typical small farm that consists of one SQL Server machine and one SharePoint server. As a good SharePoint administrator, you make backups of all your databases each night. You come in one rainy Wednesday morning to the cries of your users that “SharePoint is down!” After getting your morning coffee, you try to browse to SharePoint and realize, lo and behold, that it’s actually down. Not only is SharePoint down, it appears that the entire server is down. You can’t connect to it via RDP, and it won’t respond to pings—it’s just plain dead. You rush into the server room and you see your SharePoint server sitting at the boot screen, unable to find a hard drive to boot from. Whichever drive subsystem you had, whether a single drive, RAID 1, or RAID 5, it’s no longer working. The server and all its contents are gone. What do you do, besides verify that your resume is on a thumb drive in your pocket?

In reality, this kind of disaster isn’t too difficult to recover from, because only your SharePoint server has crashed. Although the SharePoint server is an important cog in the SharePoint system, SQL Server is equally important, and you can take advantage of your functioning SQL Server system to get things back online quickly. You need to get a server working, either by getting a new server or fixing whatever’s broken in your existing server, then reinstall Windows and get it patched, configured, and joined back to your domain. Next, you must reinstall SharePoint. After the prerequisites are installed and the SharePoint bits are installed, you need to run the SharePoint Products Configuration Wizard.

Here’s where the real magic happens. Instead of building a new SharePoint farm, you can simply connect to an existing farm. When you’re asked which farm to connect to, point the Configuration Wizard at your existing SQL Server machine and the SharePoint configuration database that it contains. Armed with the information held in your farm’s configuration database, the newly built SharePoint server can access your existing web applications and start serving them up almost immediately. SharePoint uses timer jobs to create the environment necessary to serve up your content. Your web applications will be created in Microsoft IIS with these timer jobs. Solutions that were installed in your farm will be installed on your new server via timer jobs. After the configuration is complete, you might have a few small odds and ends to clean up, but these tasks are minimal considering that you’ve recovered from a complete server failure.

You’ll need to manually restore the following, preferably from machine-level backups and notes:

Any host headers in IIS; SharePoint creates only the header that was designated when the web application was created
Any SSL certificates used if your SharePoint site uses HTTPS
Any files that weren’t added to the SharePoint root (the 14 hive) with a feature or solution (e.g., document icons)
Any changes that were manually made to web.config files, such as when configuring Forms-based Authentication

You can probably see a pattern here. For the most part, if you didn’t make a change inside SharePoint, then SharePoint doesn’t know about the change and you’ll need to manually redo it. Again, it’s a small price to pay when recovering from a total server meltdown.

SQL Server crashes. Although a crashed SharePoint server might be the low spot of your day, it’s an easy disaster to recover from because most of what makes SharePoint so valuable—the contents—is actually stored in SQL Server. But what happens if SQL Server crashes? This type of disaster isn’t terribly difficult to recover from either, as long as you have backups. If SQL Server does crash, of course SharePoint won’t be able to serve any content to your users, nor will it be able to provide any sort of administrative interface for you either. But that’s OK, because you won’t have to focus any attention on SharePoint—your recovery efforts will be focused solely on SQL Server.

Servers crash for many reasons, so recovery methods vary. Let’s start with a storage failure. Imagine that your SQL Server system itself is fine, but the disks it stores your SharePoint databases on have failed. This could be a big expensive SAN or NAS device, or it could be a couple of internal Serial ATA (SATA) drives. SharePoint won’t work until you get those databases back online. After you’ve fixed the storage issue, restore your SharePoint databases to SQL Server. As long as you restore the databases with the same names, SharePoint should just reconnect to them. You might need to reboot your SharePoint boxes to get everything working, depending on what state the connections are in. A lot of SharePoint processes talk to SQL Server, and abruptly severing those connections can make SharePoint unhappy. A good cleansing reboot typically removes any hard feelings and gets everyone talking nicely again.

What if the storage is fine, but the SQL Server system itself crashed? Maybe a power supply failed, or even worse a motherboard went out. Again, SharePoint will be pretty understanding. If you can get the server repaired and brought back to its former state, then SharePoint will have no idea anything happened. Like before, after a cleansing reboot, all is forgotten.

But what if you can’t restore your SQL Server system to its original state? What if the OS drive is destroyed, the power supply is smoking, the flux capacity is no longer fluxing, and you need to rebuild the server? Or what if the machine that failed was running Windows Server 2003 and SQL Server 2005? It seems counterintuitive to reinstall those programs in the year 2011. Fortunately, none of that matters to SharePoint. If your SQL Server system fails, you can replace it with a system that’s running a shiny new OS such as Windows Server 2008 R2 SP1 and the latest version of SQL Server. As long as the new SQL Server instance has the same name and the databases have the same name, SharePoint won’t care at all. Once again, a reboot of the SharePoint servers and everything is back to normal.

What if the SQL Server instance can’t have the same name? Maybe in the time since your SharePoint farm’s SQL Server system was deployed, your company invested a lot of time and money into a powerful new centralized SQL Server system. If your current SQL Server system has failed, now would be a great time to migrate everything to the new system. However, the SQL Server instance names are different, and you might be going from a default SQL Server instance (just the name of the SQL Server system) to a named instance (SQL Server system name plus instance name, such as sql01\shpt)—which further complicates things. This scenario might sound scary and impossible, but fortunately it’s not. Hidden in the SQL Server client in Windows is the ability to set SQL Server aliases. This is done at a low enough level that the applications themselves, in our case SharePoint, don’t know about it at all. They continue to think they’re talking to the same SQL Server system they always have, but in reality the SQL Server alias is sending the traffic to a different SQL Server system.

Are you confused yet? Don’t worry; it’s easier than it sounds. And once you get the hang of it, you can use this technique in situations that are less hectic than complete meltdowns. I don’t have enough space in this article to cover SQL Server aliases from start to finish, but I’ll cover the basics. If you want a more detailed description of the process, you can follow a step-by-step explanation in my blog post “Moving SharePoint to a different SQL server.”

The highlights of SQL Server aliases are that they’re a client-side operation; you don’t need access to the SQL Server system at all. On each of your SharePoint servers, you set up a SQL Server alias that points to your new SQL Server system. To do so, click Start, Run and enter cliconfg. Click Add to create a new alias. SharePoint uses TCP/IP to communicate with SQL Server, so you need to select TCP/IP, as Figure 1 shows. In the Server alias text box, enter the old name of your SQL Server system. This is the SQL Server system that SharePoint is configured to communicate with. Under Connection parameters, in the Server name text box, enter the new SQL Server system’s name. Click OK to finish creating the alias. If you’re feeling generous, reboot your SharePoint servers to recreate all your SQL Server connections. If you have multiple SharePoint servers in your farm, you need to create the SQL Server alias on all of them.

Figure 1: Creating a new SQL Server alias

SharePoint 2010 includes a special feature for environments that use SQL Server mirroring. (Later in this article, I cover different ways to replicate your databases in more detail and explain exactly what mirroring is.) For each database in SharePoint, you can specify a mirror server. Figure 2 shows the settings for a content database; the other databases have a similar interface. The Failover Server setting is where you specify the mirrored instance. If the main SQL Server system goes down and a mirror instance is defined, SharePoint automatically switches over to it. After you fix your primary SQL Server instance, you can switch back and reconfigure your SQL Server mirror. This lets you keep SharePoint online during trouble, as well as prevents your users from rushing your office with pitchforks and torches.

Figure 2: Content database settings

Facility failures. SharePoint can be very resilient if individual pieces fail. But what if the entire facility fails? Mother Nature is an equal opportunity destroyer. Regardless of where your servers are located, there’s a natural disaster or two waiting to take them out. Now that we know how to recover individual SharePoint pieces, we can learn how to prevent them from all failing at once.

We do have one more disaster recovery trick up our sleeves: SharePoint databases are portable between farms. I touched on this a bit in Part 1, but let’s take a minute to let it soak in. What this means is that if I have a SharePoint farm in Iowa and a SharePoint farm in Ohio, I can back up my content databases in Iowa and restore them in Ohio. Fortunately, there are no federal laws forbidding me from taking my databases across state lines. From a disaster recovery standpoint, this is priceless. Thus, if a twister rips through Iowa and my data center is taken out by a flying cow, I can quickly get my content back online by attaching backups of my databases to my farm in Ohio. I discussed this capability In Part 1, including how it can be used to recover content in another farm. However, that’s only half the battle.

With SharePoint 2010, we have an increasingly important amount of data in our service application databases. The definitions of our Business Connectivity Services connections are stored in the Business Connectivity Services database. Our term sets are stored in the Managed Metadata database. Our tags and notes are stored in the User Profile Service’s Social database. Restoring the content is good, but restoring everything is even better. Not only can we attach content databases to our recovery farm, we can also leverage some of our service application databases. That process isn’t as smooth as attaching content databases, but it’s not too bumpy. Essentially, you need to create a new service application and point it at the restore databases from your failed farm. When SharePoint creates the new service application, it will look to see if a database with the name you specified exists. If it does, instead of creating a new database, SharePoint will use the existing one instead. This process works with the following service applications:

Managed Metadata
Business Connectivity Services
User Profile Service—Only supported for Social and Profile databases; create a new Sync database when attaching to the existing Social and Profile databases
Secure Store—You must enter the key from your old farm before the new Secure Store service application can mount your database
Some service applications (e.g., Excel Service) don’t have databases, so we don’t have to worry about them at all; other service applications have databases (e.g., Search) that don’t support being attached to a different farm

To use a recovery farm, you must ensure that the farm is at the same build level or later than the farm the databases came from. If you’re not sure which build number matches which patch, you can consult my blog page for a list of SharePoint 2010 build numbers.

When disaster strikes, you can take the backup copies of your service application databases and recover them to your recovery farm’s SQL Server instance. Then, you can create the service applications that correspond to your recovered databases. If the recovery farm already has a particular service application, you can either delete and recreate it or create a second instance. You also need to attach your content databases. For more information about renaming and moving databases, see the Microsoft TechNet articles “Rename or move service application databases (SharePoint Server 2010),” and “Plan for availability (SharePoint Server 2010).".

Planning Ahead

Although I discussed the necessity of getting copies of your databases to the recovery farm, I didn’t explain exactly how to do so. Any method that replicates SQL Server databases will work; the method you use depends on a lot of factors. The most important factor is probably cost. High-availability options such as mirroring require additional SQL Server licenses and time to configure. However, you get a lower RTO and RPO for that extra cost and effort, as well as better business continuity. Database mirroring is SQL Server functionality that mirrors changes made to a database on one SQL Server system to a corresponding database on another SQL Server system. SQL Server performs this function by copying the transactions from your primary instance and applying them to your mirrored instance. This process keeps your instances in step and reduces the chance that you’ll lose data if your primary SQL Server system crashes. The mirroring options you have vary by which version of SQL Server you’re running.

If your business doesn’t have the need or funds to implement mirroring, you can also use transaction log shipping to another SQL Server instance. Then, in the case of a disaster, you can restore your database and transaction log backups to your recovery SQL Server instance. After your SharePoint installation is back online, you can use a SQL Server alias to point your SharePoint farm at the new SQL Server instance. If you don’t have transaction log backups to restore, your recovery instance could just be populated with your last database backups. Of course, this all depends on your defined RPO. The point is that however you decide to back up your databases, SharePoint can work with your method.

SharePoint is a complicated beast, and when things go wrong, they can go spectacularly wrong. The good news is that even if SharePoint does turn on you and try to destroy all your data, getting it back might be easier than you imagine. If you’re proactive and have good backups of your databases, as well as the discipline to test them, you’ll probably be able to recover from anything SharePoint can throw at you.

Comments

Plain text