SharePoint’s business-value proposition creates support pain for IT. Much of this pain is felt in backup and recovery, which must occur on three levels: item, site, and farm.
I'd like to offer a holistic view of SharePoint backup and restore and focus on creating and managing a sustainable, comprehensive SharePoint backup and restore solution. To create a plan that supports all three levels above, you must
- understand stakeholder requirements
- define service level agreements (SLAs)
- plan for a complete set of backup and restore components
- consider the technical architecture
- evaluate backup and restore toolsets
- create policy and process documentation
- provide operations and awareness training
- develop a test plan
- complete a proof of concept or pilot
- sign off with farm and application owners
- create a backup schedule
- develop a governance plan
- consider the backup and restore processes
To understand the requirements and expectations of a SharePoint backup and recovery plan, you must reach out to stakeholders, including people who
- use SharePoint daily, as a tool for collaboration
- run applications (or components) on top of SharePoint
- sustain SharePoint and the related infrastructure
Two crucial goals are at play: to gather requirements from the various stakeholders and to educate stakeholders and thereby proactively manage expectations. You do this by interviewing each stakeholder. To begin, ask business staff
- Is the data to be backed up directly linked to revenue generation?
- What is the cost per hour?
- If the data is lost, what is the cost to recreate it?
- If the data is lost, will the brand be affected?
- Is the data directly classed as corporate records?
- Who uses the data and how many rely on it?
- When do users access the data?
For IT staff, begin with these questions:
- Are any outsourcing contracts associated with backup and restore, or with the related infrastructure?
- Which backup and restore tools are in place? Do they support SharePoint?
- Which backup and restore infrastructure is in place?
- Which skills that relate to backup and restore are in place? What about skills that relate to SharePoint or Microsoft SQL Server?
- Are there constraints within the IT environment (e.g., network bandwidths, storage, tape libraries)?
- What are the existing backup rotation schedules and windows?
- Where are the SharePoint farms? What is their configuration? How much data is involved?
After you complete this, you can document service level objectives and distribute them for review. You’ll use them next to define SLAs.
Service Level Agreements
Defining SLAs requires a mix of technical skill, financial skill, and political savvy. The technical aspects of most SLAs are well defined and provided by the various backup and restore toolset venders. They have experienced staff and an abundance of documentation that can provide comparisons, value statements, and technical data.
The true challenge is creating a solution that addresses business expectations, financing (i.e., what is being requested versus what you can afford), and environmental realities (e.g., infrastructure readiness, SharePoint customizations).
In your SLAs, you state the facts regarding the backup and restore service:
- what will and what won’t be backed up, and why (think recovery time objective—RTO, and recovery point objective—RPO)
- when data will be backed up, as well as any performance, change control, or administration implications
- data restore performance and administration implications
- backup speed performance related to capacity plans
- IT, site administrator, and end-user responsibilities
- process for provisioning backup and restore
- process for recovering data
SLAs must be publicized and reviewed on a regular basis to manage expectations. Also, when you provision new farms, get business and IT stakeholders to physically sign off on their understanding of SLAs that apply to those farms.
Backup and Restore Components
A successful SharePoint backup and restore solution also includes cost, people, process, and policy to make sure that it meets expectations and is sustainable. These topics usually present the most complex or unforeseen challenges.
Backing up and restoring a SharePoint farm is a complex task. For example, you must rebuild the server or servers, load Windows Server, load SQL Server, then load SharePoint.
Then you need to apply service packs, cumulative updates, customizations—and think of all the reboots involved during the build. (See the sidebar “ Slipstreaming and WSPs” for a suggestion for customized farms.)
Backup architecture generally consists of the SharePoint farm (and backup agents installed on the web front ends—WFEs), a staging farm (usually a single server), storage (a location for disk backups), and tape backup systems.
From a storage perspective, I suggest that you plan the space you require based on the total size of your farm databases, then add a safety margin. Also consider the impact of a staging server (usually a single server with disk space to restore the databases) in your data center.
Though not specific to backup and restore architecture, your farms’ information architecture (i.e., how you provision and organize sites, site collections, and applications) is key to helping you meet SLAs, by isolating high-value data and configuring backup and restore jobs accordingly.
If high- and low-value data are combined, then meeting SLAs will be difficult because of growing backup windows and associated recovery times. Keep in mind that SharePoint-specific backup toolsets don’t have the throughput of a SQL Server backup toolset.
Another aspect of information management is archiving. Some data loses its value over time; refer to the Storage Networking Industry Association (SNIA) Data Policy model for details. Consider archival solutions that migrate such data to a low-cost repository. (Compliance-related data must be migrated to the corporate records-management system.)
Many organizations experience a 40- to 50-percent growth in data each year. Disk costs are a small component; when you factor in performance degradation, staffing, backup software, and data center costs (air, power, space), the cost of having low-value data in SharePoint and SQL Server adds up.
Also consider the capacity and utilization of the storage that you use, and plan your performance (i.e., I/O Processors—IOPs) needs with an experienced SAN administrator who knows the environment well. From an operational perspective, you want the most speed possible to keep your window small and contained.
You also want to isolate operational-related traffic so that you don’t experience network-congestion problems.
Make sure to keep detailed and up-to-date documentation for your environment. Tools that help with this process are available, such as Microsoft Single Channel Control Module (SCCM) and the free Codeplex SharePoint Documentation Generator (SPDocGen).
Backup and Restore Toolsets
Several tools are available for SharePoint backup and restore. ( See Table 3 below for an overview of tool differences.)
These differences affect how you recover, the depth of recovery, and the data center footprint required. (See the sidebar “ Change Control” for more information.) Microsoft also offers a comparison of its built-in tools and System Center Data Protection Manager (DPM.
Some tools require a staging farm to recover data, so you must plan for impacts to people, process, policy, and tools. Also note that recovery speed appears to degrade with the level of granularity. For example, list-item–level backup speed has been reported at 20GB/Hr, versus SQL Server backup speeds that are much faster.
If you have a large farm with multiple content databases, you can see that granular backups could exceed your backup window. You should also evaluate other products as part of your diligence exercise. For example, Metalogix has a SharePoint tool that lets you use a simple Windows Explorer interface to browse content databases and retrieve content.
Policy and Process Documentation
Your solution will require policy and procedural documents that operators, site administrators, and users can follow. You’ll need these documents (accompanied by training):
- How-to manual—explains how to back up and rebuild farms, and recover individual components.
- Help desk call-handling manual—explains how to handle backup and recovery requests, questions to ask, request tracking, follow-up procedures, tools to use.
- Communications plan—includes policy and instructions regarding communications with the involved parties.
- Contact list—includes media, farm owners, support, Help desk, and others (e.g., data-center personnel).
Operations and Awareness Training
After your solution's in place, people must be trained (in administration and operation) and stakeholders educated about the solution (particularly the SLAs). You also must create general awareness of the solution. I recommend the following:
- Training for operators—How to back up and restore SharePoint, how to manage related admin tasks.
- Awareness training—Staff such as stakeholders and architects need architectural information.
Follow-up sessions can reinforce key points and drive awareness. You might also create a site with information about the backup solution, such as design documents, provisioning forms, backup schedules, performance data (i.e., the speed of backup and restore), and key contacts.
Testing should include two components: initial testing of the solution in a proof of concept or pilot environment, and ongoing testing (i.e., fire drills), which should occur one or two times per year.
To test properly (and confidently), you need a documented plan that includes test scripts and the format for documenting test results. Generally, the test plan includes a list of tests, expected outcomes, and actual outcomes. The test plan should be used during the proof of concept or pilot operation, running end-to-end tests, and for getting stakeholders to sign off physically.
A good test plan displays thoroughness and helps build credibility with stakeholders. It should include the scenarios in Table 4 below.
When developing test cases, include any details that you want tested and confirm that the results are noted so that you can manage stakeholders’ expectations.
For example, the test cases for Web Parts and for data should include verification of metadata (column) recovery, content types, version history, and workflows, since these are important configuration changes and their absence can affect users.
Proof of Concept and Pilot
Whether you use a proof of concept, a pilot, or both, the outcome is generally the same: You prove that the solution works in your environment.
Your proof or pilot must reside in your data centers and in test representations of your production systems and dataset. (For pilots, you might want to back up actual production systems.) This might seem costly, but it provides a quality check that ensures that your solution works without surprises.
Your proof or pilot must include
- a charter that defines the scope of the project (e.g., technology tests, process development, performance tests)
- a staffing plan that specifies operational staff, farm or application owners, and vender technical staff
- a test plan that specifies what is being tested (e.g., farm recovery, servers, data—see Table 4)
- a physical environment plan that specifies the technology that the solution requires
The proof or pilot must also document these outcomes:
- the step-by-step backup and recovery process
- any prerequisites
- backup and restore performance
- any data loss
- a test plan report for each test
- a plan for deploying the solution into production
- a completed impact and risk assessment
When you're ready to go live with the production version of your solution, it’s good practice to have a process for onboarding each farm and application. This involves quality checks to verify that backups complete without errors, restores complete without errors, and backup and recovery times and restore points meet SLAs.
Upon recovery of each farm or application, the owner reviews the farm, based on the test plan, completing a series of tests to verify that data was restored correctly. Your tests should also check the logs and verify that the expected quantity of sites and data volumes was restored.
The more quality checks you have, the better. Each owner signs off by using a paper or electronic form.
When planning your backup schedule ( see Table 5 below for an example), make sure that you can recover successfully and that the servers aren't saturated as a result of running multiple jobs.
Consider the following:
- Should you run full backups monthly or weekly? Depending on your SLAs, weekly is probably best.
- When should you run incremental backups? Daily is the norm.
- What is the duration of your backup jobs? You must plan backup windows to avoid overlap with other jobs (e.g., virus scans), which could degrade performance or even cause outages.
The best approach is to list all jobs that will run, document their duration and the load they place on servers, and map out a visual schedule. With this, you can monitor the jobs for successful completion, increases in duration, and exceptions.
Backup and recovery needs tools, process, policy, and staffing to function properly. Non–IT staff tend to oversimplify technical aspects, while IT staff tend to complicate them.
Governance creates a forum, letting the organization work through requirements and issues toward consensus.
A governance plan should designate an executive decision maker; stakeholders from business and IT groups; tools for tracking issues, discussion topics, and decisions; a decision framework, and a communications plan.
Backup and Restore Processes
After preparation come processes. For backup, consider everything that you need to restore SharePoint. For recovery, consider what you need to recover SharePoint and the data it contains.
Are you responsible for Windows Server recovery or is another party? Often SharePoint backup and recovery toolsets require servers to be loaded with Windows Server and joined to the domain. If you rely on another party, work with them to obtain specifics regarding SLAs and other details.
Since the actual step-by-step backup and restore processes depend on the toolset used, Table 6 below shows just general steps for recovery.
To safeguard against loss from a catastrophic event, keep duplicate copies of backups in a separate location from the servers. Also, set a retrieval process in place, communicate it through training, and test it.
As a best practice, keep three copies of the backup media, and keep at least one copy off site in a controlled environment.
Keys to Success
The key is to match business needs and expectations with your financial budget. In addition, review the solution SLAs with key stakeholders on a regular basis, since needs are always in flux.
Slipstreaming and WSPs
If you have customized your farm, you might want to slipstream your SharePoint installation. The blog post "Slipstreaming SP2 into SharePoint Server 2007" describes the process for SharePoint 2007. Also, if you have customizations, make sure that they are packaged in SharePoint Solution Packages (WSPs). Hopefully, your developers created WSPs to automate installation, but if not, the blog post "Creating a SharePoint Solution Package (.wsp) in 5 steps" covers the process.
Depending on the version of SharePoint and the backup and restore product you choose, some products (e.g., HP Data Protector, Microsoft DPM for 2007) require a staging farm to retrieve content. In this case, your staging farm must be in sync with your production farm, from a feature perspective. If you omit this key set of activities, recovery will fail because recovered sites will expect the presence of features (and any other custom code) that won’t be there.
In this case, your change control process must be augmented so that any changes that are made to your production farm are mirrored to the staging farm. Products such as AvePoints DocAve don’t require a staging farm; rather, such products restore in place thanks to proprietary technology (at the sacrifice of speed).