It seems like every time Microsoft releases a new product, we all have to learn a long list of new acronyms. This is certainly true with Exchange Server 2007, which brings a slew of new features, some with jawbreaking names such as WebReady Document Viewing that cry out to have acronyms created. One new Exchange acronym that's drawing a lot of interest is CCR, the cluster continuous replication technology that lets clustered mailbox servers operate with no shared storage subsystem. It turns out that CCR is based on another, lesser-known technology: majority node set, or MNS, clustering.
If you've worked with clusters before, you know how important the integrity of the quorum resource is. The quorum is essentially a configuration database for the cluster. Each node in the cluster needs the ability to take ownership of the cluster and thus control what configuration data is changed and when. In a conventional Microsoft Cluster Server (MSCS) cluster, there's one quorum, owned by one node at a time. If the quorum resource is lost, or if a node can't access it, a variety of bad things can happen—including split-brain syndrome, where each of the remaining cluster nodes thinks it is the quorum owner.
MNS works differently from standard clustering because a copy of the quorum database is kept on each individual node. Changes to the quorum are only considered to be permanent if the change can be verified as committed to a majority of the MNS nodes. For example, in a four-node cluster, a change to the quorum will only be accepted if three of the nodes verify that the change was made to their local quorums.
CCR is based on MNS, but its implementation is a bit different from standard MNS clusters. MNS obviously requires more than two nodes in the cluster, so it turns out that you can actually implement CCR in two ways: using a three-node MNS cluster; or using two nodes and a third, uninvolved machine that acts sort of as an auxiliary quorum member. This machine is said to hold the file share witness (FSW) role—another acronym to learn! The FSW is a new feature introduced as a hotfix after Windows Server 2003 SP1; it essentially lets the quorum resource be copied to a computer that isn't part of a cluster, such as on a Hub Transport server in the same site as a CCR cluster. The cluster nodes can update, and read from, the FSW and use it as a third "vote" for getting and setting properties of the cluster.
What does this mean for CCR implementation? MNS is a commonly used technology for larger Exchange Server clusters, but CCR is limited to one active and one passive node. You can certainly build an MNS cluster using one active and two passive nodes, but there's not really much point in doing so for Exchange CCR (although Microsoft supports doing so.) The third node will essentially always remain passive, wasting a perfectly good piece of hardware.
Instead, plan on using a FSW on another machine. The FSW role is low-impact and can be configured on any computer, not just another Exchange server. If you're planning a geographically distributed CCR cluster, Microsoft recommends that you put the FSW in the same physical site as the node that will ordinarily be active. By doing so, you prevent a network failure between the two sites from shutting down the cluster. However, this configuration means that a failure of the primary site will require manual failover—the passive node won't be able to contact the FSW to reconstitute the cluster. There's a solution, which involves using a third site to host the FSW. I'll talk more about that next week.