The path to native high availability for Exchange

I received quite a few notes after recent posts covering how Exchange’s storage demands have evolved over the last decade and what this means for third-party vendors who sell high-end storage. Some pointed out that the storage vendors won’t mind too much if some of their market has disappeared because Exchange now favors JBOD. After all, there are many other ways to use a SAN to good effect. SAP, for instance, just loves SAN storage.

Among the email was a note from Greg Thiel, the éminence grise of Exchange High Availability. As such, Greg is pretty focused on all aspects of availability. He pointed out that one of the major influences in the effort to wean Exchange off SANs was the realization (in the summer of 2004) that the best model to head towards was one where databases had multiple copies, each with their own storage. Greg’s view is that once this realization happened, the only route forward led to DAS.

There’s a lot of truth here. Create databases as self-contained islands of high availability with their own storage and servers become processing units that can be picked up as needed. In this respect, high availability means having more than two copies available for a database because a reasonable chance exists that multiple copies become inaccessible in one incident.

Greg pointed out that another important step was the decision to build high availability into the product rather than to depend on storage or other hardware. In his words, this created a “more natural Exchange experience”. Another way of looking at this is that by building High Availability into Exchange, it meant that the Exchange administrator could manage all aspects of the solution without having to become a clustering or storage expert. One outcome of this approach is that more people started to incorporate high availability into their messaging systems than ever before.

To put these points in context, think of the situation with Exchange 2003. It was a world where storage groups and the STM (streaming) file. Storage groups were intended to make management easier but they really didn’t. Wolfpack (I loved the name if not the software) clusters worked and many TechEd sessions were given about the 7-node monster operated by Microsoft IT. But even monster clusters couldn’t support as many mailboxes as you’d imagine because of the I/O overhead. Exchange 2003 sure loved SAN storage and SAN storage loved Exchange 2003.

Exchange 2007 introduced the LCR/CCR/SCR mechanisms to start the ball rolling with database copies. Only one passive copy could be maintained and failover was problematic at times, but this was a huge step towards the Database Availability Group (DAG) in Exchange 2010.

DAGs were initially viewed as a form of black magic. Early sessions by folks like Tim McMichael at conference such as IT/DEV Connections caused many eyes to water due to the concentration required to understand the interaction between Windows Failover Clustering and Exchange and some of the holes into which people might fall.

But DAGs have blossomed and grown in capability as time allowed for more engineering and knowledge of DAG operations in the field accumulated. Advances like single page patching, the IP-less DAG, and improving the recoverability of lagged copies have all helped. And of course, other changes in the product such as namespace simplification have helped as well.

We’re at the point where I can’t imagine a deployment going forward without using DAGs. If your environment is too small to run a DAG, then perhaps Office 365 and Exchange Online is a better option. Thousands of DAGs run inside Exchange Online, all based on low-cost JBOD and all done without any special intelligence in the storage layer. It’s all native Exchange.

To come back to my original point, DAGs and multiple database copies would not be as cost-effective and powerful as they are today without the radical reduction in IOPS that Microsoft achieved over four software engineering cycles. It’s hard to imagine being able to justify multiple database copies on SAN storage. It’s also true that you’d need multiple SANs to be able to achieve true high availability. CIOs would love that.

The story of how Exchange weaned itself from SANs is multi-faceted. I hold to my view that the crusade to reduce IOPS represents a critical contribution to Microsoft’s ability to deliver is used on-premises and in Exchange Online today, but that realization in summer 2004 that databases should have transferrable multiple copies based on independent storage was critical too. Put together, low I/O demand, cheap storage, and integrated, easily-managed high availability creates a pretty potent package. Just the kind of characteristics you’d want to see in a system that needed to be scaled up to deal with the cloud.

Many stories have a strong main plot and some of sub-plots. The journey taken to transform Exchange from on-premises software to being cloud-capable has many complex and interconnected turns. IOPS is one, high availability is another, but there are more, such as the adoption of PowerShell. All important advances that might prosper individually but become so much more powerful when combined. Debating what was the most important technical advance in this story is terrific fun.

Follow Tony @12Knocksinna

Comments

Plain text