Best practice evolves as knowledge transfers from Office 365 to on-premises Exchange

If you attended the recent Microsoft Exchange Conference (MEC) in Austin with a focus on some of the high availability sessions, you might have come away with a strong impression that MIcrosoft has changed its view on two aspects of what has been been well-accepted best practice for Exchange deployment until now. Of course, best practice is never static as it flexes and evolves in light of experience and changes in software and hardware, so it’s hardly surprising that the first Exchange conference in eighteen months might challenge some established ideas.

The first change is in the number of Network Interface Cards (NICs) recommended for servers that are members of Database Availability Groups (DAGs). When Exchange 2010 was under development, Microsoft’s original stance was that a server could only be a member of a DAG if it had more than one NIC. Customer pushback resulted in a decision to allow Exchange 2010 servers to be in a DAG even if they had only one NIC, a welcome step because it made it so much easier to build test servers. Even so, most system designers continued to specify dual NICs on the basis that this provided servers with redundancy against network failure (but only for the MAPI or client-facing network). Multiple NICs also allowed administrators to isolate the replication traffic (log shipping and database seeding) from client traffic, something that was useful at a time when a single NIC might be overwhelmed by replication traffic and so impact responsiveness to clients.

It’s certainly true that Exchange will use multiple NICs if they are present. On dual-NIC servers one NIC will be used for the client network (the one registered in DNS and available to clients) while the other (an internal network) will be used for log shipping. Dual NICs allow for a certain level of extra redundancy but only if they are connected to separate physical networks. If not, then a network failure will affect both NICs and redundancy goes out the proverbial window.

And here’s another truth that influences the debate: the software has gotten a lot smarter about database failovers with changes made in Exchange 2013 to enable databases to come online faster plus advances in features such as Safety Net to guard against losing messages in-transit during outages. A database failover is no longer a cause of great concern because we have become accustomed to these events, use better hardware, and know how to manage database transitions and load across DAGs. It’s also true that wider use of 10 Gigabit NICs makes it less of a requirement to have multiple NICs to handle network loads. Overall, it doesn’t really matter as much as it used to if a NIC has a glitch that causes Exchange to failover some databases. The same solution (database failover) is used for any other failure that might affect a DAG member.

Simplicity offers many benefits to IT operations. Keeping servers as simple as possible makes them easier to deploy and manage. This is a principle used in server design for in high-end deployments such as Office 365 and my bet is that Microsoft’s changing opinion about NICs comes from experience gained within the service where the need to manage multiple networks on the 100,000 Exchange servers that deliver Exchange Online would increase the complexity of system design, monitoring, and analysis.

In an on-premises context you have to make your own mind up whether additional NICs deliver any advantage. It could be the case that you have a well-founded reason for equipping servers with extra NICs, such as using a dedicated NIC to ensure that backups can be taken in a certain period without affecting other traffic. Ask yourself if the extra hardware will actually make a server more resilient and if so, under what exact circumstances? Will the extra NIC (or two) enable higher uptime? How many of the failures leading to Exchange outages in the last year were caused by network components and would you have had more if servers had only one NIC? It’s an interesting exercise to justify the extra cost of the additional NICs.

The second change is an increased focus on lagged database copies. I was no fan of lagged copies in Exchange 2010 and said so many times, possibly becoming rather like a scratched record on the topic. But it was early days and Exchange 2013 is a different beast, and so is the amount of experience that we collectively possess about Exchange outages and fixes.

A lagged database copy is kept at a certain time distance from the active database and normal copies with the intention that it can be brought online if something radical affects the active database and the other copies such as a massive data corruption caused by hardware. Given the state of hardware and monitoring today these events are relatively rare, but they do happen, and if you don’t have a lagged database copy, you have to restore from backup.

Microsoft doesn’t use backups in Office 365 as it would be impossible to back up so many servers. They depend on lagged database copies to protect against corruption. And like many examples of where the needs of the service have driven improvement in Exchange, lagged database copies are easier to deal with. As such, they’ve become a viable alternative to traditional backups. That is, if your audit and legal departments are happy for you to run Exchange without backups.

Don’t rush to embrace lagged database copies without engaging your brain. Operational processes and procedures have to be changed to accommodate the use of lagged copies and detailed (and well-tested) steps have to be provided to describe how to restore from a lagged copy. Take the TechNet instructions and tweak them for your environment and all should be well.

People sometimes get upset about the way that Office 365 is obviously driving innovation in the Exchange space at present. But I see more and more ideas, technology, and knowledge being transferred from the service to on-premises customers. Not everything flowing from the service can be easily applied (or is even valid) for on-premises deployments, but it does force you to think.

Follow Tony @12Knocksinna

Comments

Plain text