SharePoint server architecture in both Microsoft Office SharePoint (MOSS) and Windows SharePoint Services (WSS) lets you create a robust, fault-tolerant, and highly available SharePoint farm designed to survive the loss of any one component. But it’s not readily obvious how to do this out of the box, and some of the guidance doesn’t cover all availability concepts.
To further complicate things, there’s a great deal of confusion about the difference between disaster recovery and high availability. High availability generally refers to the concept of keeping an application or service running and available for use in the event of a failure of part of the infrastructure, while disaster recovery refers to a process of recovering an environment that has already failed.
As this article specifically focuses on high-availability concepts, let’s dive into SharePoint high-availability concepts first, then look at some prescriptive guidance for making components in a SharePoint farm fully redundant and highly available.
Understanding SharePoint Server Role Availability
The base architectural component in a SharePoint environment is the SharePoint farm, composed of multiple servers that work together to store content and display it for end users. Each server in the farm can hold one or more server roles that determine what job the server plays in the farm topology.
For example, the web role utilizes Internet Information Services (IIS) to display content for users, while the index role is responsible for indexing content so that it can be made available for search. To gain a full understanding of SharePoint high availability, let’s examine each role and how it works.
Database Role Availability
The database server role, which uses Microsoft SQL Server 2008 and 2005 to house crucial SharePoint databases, can be made highly available by traditional Microsoft Cluster Service (MSCS) failover clustering. If a cluster node were to fail, the second node in the cluster would take over the database role seamlessly.
Clustering is a complex topic, but to simplify, all nodes in a particular cluster have direct access to a shared storage location (such as a SAN disk volume) where the databases are stored and can constantly communicate with each other to take over in the event of an outage. SQL Server 2008 running on Windows Server 2008 is highly recommended as it has the most functional, easy-to-configure clustering options.
A strong SQL Server recommendation for a SharePoint environment is to use a combination of a DNS CNAME record or a SQL Server alias for SharePoint servers to connect to, rather than the actual name of the SQL Server server or the cluster. This gives you the flexibility to move SharePoint databases to another SQL Server instance in the event of an outage or for general housekeeping.
By using an alias name to connect to (i.e., spsql.companyabc.com), admins can save themselves the headache of having to go through Microsoft’s documented procedure for moving to a new SQL Server instance, which involves a command-line operation (stsadm –renameserver) and a full reindex.
Web Role Availability
To achieve high availability of the SharePoint web role, load-balance the traffic sent to multiple web role servers by using a hardware load balancer or Windows Network Load Balancing (NLB). Load-balanced web role servers share virtual IP addresses (VIPs) so that, in the event of a failure, the traffic sent to the VIP is sent to an available host.
A few caveats exist with NLB for use with SharePoint, however. First and foremost, be sure to enable site affinity, also known as “stickiness,” which forces users to use a single server for their session, unless that server is down. This reduces issues caused when a client’s session is sent from one server to the next.
If using software NLB, be aware of two caveats associated with the type of NLB configured. With multi-cast NLB, routers must be specially configured or the packets will be dropped. Uni-cast NLB doesn’t require this special configuration but does require a dedicated NIC for the intra-array traffic. The servers communicate heartbeat information to each other across the dedicated NIC, which can reside on the same network as the standard NIC.
Query Role Availability
The query role provides search results that are pulled from the full-text index used by SharePoint Enterprise Search. Multiple query role servers can be utilized in a farm, and referrals to them for searches are made directly from the web role servers.
What this means is that query role servers don’t need a technology such as NLB to be made redundant; instead, simply having more than one query role server allows for search functionality to be made highly available. One caveat associated with the query role is that it can’t be made highly available if it resides on the same SharePoint server as the index role component.
In other words, if you place the two roles on the same server, then SharePoint will no longer propagate a copy of the index to any other location, even if you try to make another system a query server. The only way to effectively make Search highly available is by subsequently deploying a dedicated index server, then adding the query role to at least two other servers so that the index will be propagated and will be made available in the event of an outage.
Index Role Availability
The index role is the only SharePoint role that can’t be made highly available, but since the loss of index functionality isn’t immediately noticeable, this might not be an issue. If the index server is down, Search will still work as long as there are available query servers in the farm.
The only noticeable effect would be that new items added to SharePoint or other content sources wouldn’t show up in search results until the index server was rebuilt or recovered and indexing continued.
SharePoint Central Admin Role Availability
One commonly overlooked role from an availability perspective is the SharePoint Central Admin role, which can be easily made highly available but often is not. Central Admin, which is used to administer SharePoint, is simply a SharePoint web application that’s connected to a dedicated site collection in a dedicated SharePoint content database.
You can make it highly available in the same way that you would make any other web application redundant in a SharePoint environment. Unfortunately, Microsoft doesn’t make this obvious, but the high-level steps involved in making the tool redundant include the following:
1. Turn on the SharePoint Central Admin role for a second server in the farm, typically a second load-balanced web role server.
2. Change the registry setting on SharePoint servers that defines which address to use for Central Admin: in this example, a load-balanced Fully Qualified Domain Name (FQDN) of http://spca.companyabc.com:8888. This will also change the default address that the local SharePoint server uses when clicking on the link to start Central Admin.
The registry setting for this example is as follows: HKLM\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\12.0\WSS\CentralAdministrationURL (REG_SZ) = http://spca.companyabc.com:8888
3. Change your default Alternate Access Mapping (AAM) for the SharePoint Central Admin web application to http://spca.companyabc.com:8888.
4. Add a DNS “A” record that points spca.companyabc.com to a load-balanced IP that corresponds to both SharePoint servers (either hardware- or software-based NLB will work).
Note that in addition to load-balancing Central Admin, you can also enable SSL encryption and Kerberos authentication, and assign a standard port (443) for the HTTPS traffic. Microsoft not only supports these configuration changes but also recommends them for security and availability.
SQL Server Database Mirroring High-Availability Options
In addition to traditional clustering, the database role can also take advantage of SQL Server database mirroring and log shipping to make mirrored copies of SharePoint databases on another SQL Server instance.
While often used to provide for disaster recovery of SharePoint content, one form of database mirroring, known as synchronous mirroring, can be used for high availability of the databases in a SharePoint farm. In this scenario, SharePoint databases are synchronously mirrored from a principal SQL Server server to a mirror server, while a third server, the witness server, stands by, waiting to fail over the databases to the mirror server in the event of an outage.
Database mirroring is supported in SQL Server 2005 SP1 and greater, including SQL Server 2008. High-protection database mirroring is available with both the Standard and Enterprise editions of SQL Server, whereas the high-performance option is only available with the Enterprise edition. SQL Server database mirroring can be set up in three ways depending on specific needs, available bandwidth between servers, and the SQL Server version used:
High protection—With high protection, all SharePoint databases can be synchronously mirrored to a second SQL Server instance and made available in the event of an outage of the principal server. Failover isn’t automatic with this model, so it’s not a true high-availability solution.
High availability—The only database-mirroring option that provides high availability for SharePoint, this option performs synchronous mirroring and also allows for automatic failover of the databases to the mirror server with the addition of a witness server. This option provides high availability of SharePoint content when used in conjunction with a SQL Server alias configured on the SharePoint servers and is available with SQL Server 2005 Standard and Enterprise editions.
High performance—The high-performance option is available only with SQL Server Enterprise Edition and uses asynchronous mirroring, which doesn’t wait for the data to be written into the mirrored server before it’s committed. While this can result in data loss, it’s the only scenario that’s feasible if the mirrored SQL Server instance is located across a WAN link with high latency or low bandwidth.
The only databases that asynchronous mirroring supports are the SharePoint content databases, which limits this option to a disaster-recovery–only solution. Failover using the high-availability option is handled by the witness server, which automatically senses the failure of the principal server and enables the mirrored versions of the databases. Since SharePoint is not mirroring-aware, the witness server must subsequently act to modify the SQL Server client alias on the SharePoint servers to point them to the new SQL Server location.
The high-availability option can be used for local failover scenarios, where both principal and mirror session are in the same datacenter, or it can be used in remote failover datacenter scenarios, such as what is illustrated in Figure 1, but only if there is very low latency (less than 1 millisecond) and very high bandwidth (1Gb or greater). These scenarios are discussed in more detail in the Microsoft whitepaper "Using database mirroring."
Highly Available Farm Architecture
The smallest SharePoint farm that’s fully highly available (i.e., the loss of any one server doesn’t noticeably affect clients) is a five-server farm composed of the following server roles:
• Server 1—Web/Query/Inbound Email/Central Admin #1
• Server 2—Web/Query/Inbound Email/Central Admin #2
• Server 3—Index
• Server 4—SQL Server Database Cluster Node #1
• Server 5—SQL Server Database Cluster Node #2
Because they’re load-balanced, the web/query servers continue to operate for web requests, inbound email to document libraries, and search queries. The SQL Server environment clustering handles failover of the database role. The index role, as mentioned earlier, can’t be made highly available, but since a failure isn’t visible to the end user it’s not required to be made available.
Server Virtualization Options
Server virtualization technologies can help organizations that can’t deploy five physical servers or want to take advantage of virtualization improvements and cost savings. Microsoft fully supports MOSS running on server virtualization software that has been validated as part of the Server Virtualization Validation Program (SVVP), outlined in detail at the Microsoft support site.
This includes virtual solutions such as Windows Server 2008 Hyper-V, VMware ESX, Citrix XenServer, and many others. That said, certain SharePoint roles such as the database role aren’t the best candidates for virtualization, though with proper attention to disk infrastructure and CPU allocation, all components can be virtualized.
Virtualization provides flexibility in a SharePoint environment, allowing for full high availability to be built for organizations that normally wouldn’t be able to afford it. Figure 2 illustrates a two-virtual host environment.
This environment lets an organization make web/query servers highly available and take advantage of the high-availability mirroring option to provide full failover between virtual hosts. This architecture has the added advantage of letting an organization deploy multiple SharePoint farms, including farms for testing and development.
Virtualization software such as VMware VMotion, Citrix XenMotion, or the soon-to-be released Windows Server 2008 Hyper-V Live Migration let you add an additional high availability layer to a SharePoint environment. They work in similar ways, automatically moving a virtual guest from a failed virtual host to another host, providing for high availability of the server session itself.
Many organizations are adding this additional layer to SharePoint high availability solutions. For more information on virtualizing a SharePoint environment, see Microsoft’s white paper, "Virtualization of Microsoft SharePoint Products and Technologies" \[downloadable PDF\] and the Windows IT PRO article “Coordinate a Virtualized Environment for SharePoint.”
Third-Party Replication High-Availability Options
Some organizations have enhanced their SharePoint high-availability options by deploying third-party replication solutions that replicate SharePoint documents, lists, and libraries to multiple locations, as Figure 3 shows. By replicating content to these locations and utilizing global load balancers such as Citrix NetScalers, Cisco Content Switches, F5, and others, requests to a single SharePoint FQDN can be directed to a local copy of the content.
When changes are made to the content, the third-party software replicates them to all other farms. If a single farm fails, requests can be automatically referred to another farm within the organization, allowing for instant failover across sites. Multiple third-party vendors providing replication software include AvePoint, CASAHL, echoTechnology, Infonic, Syntergy, and others.
Making SharePoint Bulletproof
It’s not immediately obvious how to make SharePoint architecture highly available, but armed with the proper knowledge of SharePoint role availability and the best practices outlined in this article, SharePoint admins can design a bulletproof SharePoint environment without breaking the bank. Out-of-the-box features such as NLB, clustering, and high-availability mirroring can be combined with other high-availability solutions such as virtualization or third-party replication to meet the Service Level Agreements of any organization.