Exchange Server 2013 Load Balancing and Health Checks

In Exchange Server 2007, Microsoft introduced load balancing to ensure the availability of Client Access servers. In subsequent Exchange versions, the principle hasn't really changed, but Microsoft has made a lot of changes under the hood that have greatly improved the end user's experience in case of a failure. After I discuss the benefits of using a load balancer compared to other solutions, I'll discuss the different load-balancing scenarios in Exchange Server 2013 as well as under-the-hood improvements in it.

The Allure of Load Balancers

One of the arguments for choosing a physical or virtual load balancer over other solutions such as round-robin DNS or Windows Network Load Balancing (NLB) is that load balancers offer more flexibility with regard to health checking. Round-robin DNS doesn't offer any form of health checking. Therefore, users might be redirected to a faulty server, as DNS is unaware when a server is unavailable. NLB does a better job, but it's limited to executing a PING command for each server in the array. As such, NLB's knowledge about the server is limited. If a server responds to the health check, it's considered running and therefore available for accepting requests, even though one or more services (not related to the health check) might be down.

Like NLB, load balancers execute periodic probes to verify the availability of the underlying Exchange servers. Their probes, however, are usually more advanced and are therefore a better way to determine a server's health. If a server is found to be unavailable, the server is taken out of the array and new client requests won't be forwarded to that server until it's repaired.

Layer 4 or Layer 7?

The OSI model conceptually describes how network communication systems should interact by breaking the communication flow into seven distinct layers. Each layer represents a specific function. Layer 4 corresponds to the transport layer, which describes the ability to reliably transfer packets between nodes on the network. A good example of a Layer 4 protocol is TCP. Layer 7 is the application layer. It interacts with the software that communicates across the network. HTTP is a good example of a Layer 7 protocol.

In the load-balancing world, when a device operates at Layer 4, it will only be able to perform simple actions such as receiving and forwarding traffic. At Layer 7, however, a device can take interact with the higher-level protocols such as HTTP or SMTP, opening up a range of possibilities including the ability to read and modify URLs on the fly.

When Exchange 2013 was released, Microsoft stated that it would work with both Layer 4 and Layer 7 load balancing. This news traveled fast but also created a lot of questions. While it's true that Exchange 2013 (unlike Exchange Server 2010) works happily on Layer 4, it might not be the best solution. Let me explain why.

Load balancing on Layer 4 is, in essence, a very simple way of handling things. The load balancer makes its forwarding decisions based only on the IP address and port on which it received the client's connection request. The load balancer isn't concerned about the type of traffic passing through. For example, it doesn't care whether someone is connecting with Outlook Web App (OWA) or Outlook Anywhere.

Making things simpler is usually a good thing, as reducing complexity typically means smoother operations. However, in this particular case, load balancing on Layer 7 proves to be more robust, at least from a service availability point of view.

When operating at Layer 7, a load balancer is concerned about the type of traffic passing through. It inspects the contents of the conversations between the clients and the Exchange server and uses this information to make its forwarding decisions. As such, it can route traffic based on the virtual directory to which a client is trying to connect.

Figure 1 illustrates this scenario. When traffic hits the load balancer, it uses a different routing logic, depending on the URL the client request is sent to.

Every load balancer vendor has its own name for this feature. For example, KEMP Technologies calls it Sub-Virtual Services. The main virtual service is defined by the combination of the IP address and TCP port to which the client connects and the Sub-Virtual Services are defined by the virtual directory. Typically, a regex rule is used to detect the virtual directory to which a client connects.

By distinguishing between the types of workloads, you can define a different health check for each type of workload, although how you do this will depend on the load balancer you're using. Having a health check for each type of workload ensures more granular control over what servers are used for what type of workload.

In contrast, in traditional Layer 4 scenarios, a single health check decides whether a server is healthy and can accept traffic. The problem with this approach is that there isn't a standard set of criteria that summarizes the health of an Exchange server. A lot of deployments check the availability of the /owa virtual directory. However, just because OWA passes the health check doesn't mean the other services on the same server are healthy. Similarly, you can ping a server to determine whether it's running, but that doesn't ensure Exchange is working. These are all valid options, but none of them are really great because instead of maximizing a server's capacity and making decisions based on the availability of Exchange, the routing decisions are based on the availability of the server in general. Even worse, you're not making decisions about Exchange as a whole, but rather on one piece of it. As my colleague Paul Robichaux puts it, your doctor doesn't give you a single numeric score indicating whether you're healthy or not. He or she checks your circulatory, respiratory, musculoskeletal, and other systems. Any or all of those systems can be healthy or unhealthy, sometimes independent of the others.

Because Microsoft is running Exchange at an insanely large scale in Office 365, I have no doubt that it encountered this problem and therefore sought a solution. Now, before taking a look at the improvements leading to this solution, you should know there's a workaround that lets you load balance on Layer 4 and still use multiple health checks.

As mentioned previously, the limitation of load balancing on Layer 4 is that the load balancer can only make decisions based on the IP address on which the client's connection request was received. If you take that into account, you can create a separate namespace for each Exchange workload (e.g., owa.domain.com, ecp.domain.com, autodiscover.domain.com, ews.domain.com) and create a separate virtual service for each workload, as shown in Figure 2.

Figure 2: Creating a Separate Virtual Service for Each Workload

The downside to this approach is that you need to have a separate public IP address for each Exchange workload, assuming you want the services to be externally accessible.

The Improvements

As mentioned previously, a load balancer can execute multiple health checks when operating on Layer 7, but that doesn't necessarily mean that each health check is a reliable way of determining whether a workload is functioning correctly. Working with load balancer vendors, Microsoft addressed this problem by introducing two new features in Exchange 2013: Managed Availability and a health check web page.

Managed Availability is Exchange's built-in monitoring platform. It consists of three key components:

Probes
Monitors
Responders

These components work closely together to test, detect, and mitigate possible problems, effectively creating a sort of self-healing system within Exchange. First, Managed Availability runs a probe. Depending on which probe is run (there are literally hundreds of different probes), it will gather information about, or execute a series of tests for, a specific component.

A monitor is then used to evaluate the results of that probe and use the gathered information to make the decision whether the component is healthy or unhealthy. If a component is deemed unhealthy, a responder will take appropriate countermeasures to bring that failed component back to a healthy state. There are many different responders that, depending on the type of failure, can take different actions, ranging from a simple service restart to a database failover or even a full server reboot. If you'd like some more background information about Managed Availability, I suggest you start by reading Tony Redmond's Exchange Unwashed Blog entry "Server heal thyself - Managed Availability and Exchange 2013."

After the Managed Availability feature determines that a server is healthy, Exchange 2013 dynamically generates a web page named healthcheck.htm, which Figure 3 shows. This health check web page can be found under each virtual directory (e.g., /owa/healthcheck.htm, /ecp/healthcheck.htm).

Figure 3: Generating a Web Page Indicating the Server Is Healthy

When a server is unhealthy, the web page isn't generated and a 403 error is returned. As a result, if you point your load balancer to this web page, it will consider a server unavailable only when the page can't be found, which happens only when Exchange determines a workload is unfit for duty.

The end result is that the health checking goes much further than it used to. The new health check web page truly exposes a workload's health by taking into account multiple internal health check probes performed by Managed Availability. The merit here is that Managed Availability does a far better job than anything else at determining the server's health. Even more so, it won't necessarily take certain workloads offline if others are still functioning correctly.

Improvements Worth Noting

Although not the biggest changes in Exchange 2013, the addition of Managed Availability and the health check web page greatly improve how a server's health is exposed to a load balancer (or any other application for that matter). As a result, you won't necessarily lose a server as soon as a single workload is down or, even worse, have traffic routed to a server on which a given workload (e.g., OWA) is unavailable because the load balancer didn't notice the problem.

Comments

Plain text