Windows NT Load Balancing Service (WLBS) handles load balancing and provides redundancy in the event of server failure in a Web server farm. To provide this functionality, WLBS doesn't monitor specific ports or services. Rather, it uses periodic signals (i.e., heartbeat signals) to monitor cluster members. If a Web server fails to send a heartbeat signal, the other cluster members take over its Web service. In other words, the rollover of the Web service occurs only if a Web server shuts down. If a server's Web service stops or hangs, WLBS doesn't recognize the problem and doesn't transfer that server's Web service to the remaining cluster members. Any users connected to that server will then receive error messages.
The Microsoft article "WLBS Does Not Detect Program or Service Problems" (http://support.microsoft.com/support/kb/articles/q234/1/51.asp) recommends that you use a third-party SNMP utility to monitor the servers' Web services and use WLBS-specific commands to remove or add servers to the cluster based on the utility's results. I wrote a script, WLBS.pl, that you can use to accomplish the same monitoring task. WLBS.pl is much easier to configure and use than an SNMP utility. In addition, the script is free and has low resource overhead.
How WLBS.pl Works
A Web service can be running but unresponsive because of deleted page content, incorrect file permissions, or a hung server. WLBS.pl monitors Web service responsiveness by checking for the successful retrieval of one or more specific Web pages that the local machine serves up. As a result, you must run WLBS.pl locally on each cluster member, which means that you're running the script on the server the script is monitoring. Typically, you monitor servers from an independent server because if the server you're monitoring goes offline, the independent server can still notify you about the problem. However, in this case, if the Web server goes offline, you'll know because WLBS will roll over that server's Web services.
If a Web page fails to load, the script issues the command wlbs disable 80 from the local machine. This WLBS disable command removes the server from the cluster. The script then sends a notification message. The script can send an email message, a pager message through an email-enabled paging service vendor, or an NT messenger-service message through the Net Send command. The server remains disabled until you correct the problem.
You set the test cycle frequency (i.e., the frequency with which the script tests for Web service responsiveness) by making two adjustments. First, you use the Task Scheduler to schedule how often you want the script to run. Then, in the script, you set the number of test cycles per minute. You need to determine the test cycle frequency based on the acceptable delay for failover in your environment. A good starting point is setting the Task Scheduler to run at 1-minute intervals and setting the script to run 10 test cycles per minute so that the script runs the test loop every 6 seconds. This test cycle frequency provides quick failover. Because the script is in Perl, the CPU overhead is quite low. However, frequent scripted page requests can inflate your Web hit metrics.
How to Use WLBS.pl
Listing 1, page 6, contains an excerpt from WLBS.pl. You can find the entire script in the Code Library on the Win32 Scripting Journal Web site (http://www.win32scripting.com/). The script includes comments to help you understand the code. I tested this script on servers running Windows 2000 and NT 4.0, Service Pack 5 (SP5) and SP6. Here are the steps to get the script working on your machines:
- Install ActivePerl, build 522 (available at http://www.activestate.com/), and the Mail::Sendmail module (available at http://www.cpan.org/) on all WLBS cluster members.
- Copy WLBS.pl onto each server in the cluster.
- Use the Task Scheduler or the NT shell's At command to schedule how often you want the script to run.
- Configure WLBS.pl on each server in the cluster. You need to configure
- The local URLs you want to test on that particular server. Don't use the WLBS cluster alias as a test target. The URLs must reside on that local server because you want to determine failures only on that machine. Use a comma to separate multiple URLs.
- The number of test cycles per minute. Valid entries are 1, 2, 3, 4, 5, 6, and 10 times per minute. The higher the number of test cycles, the quicker the failover (but the higher the CPU overhead). The lower the number of test cycles, the lower the CPU overhead (but the longer the delay before failover).
- The number of seconds the script waits for the Web server to respond. Valid entries are 1, 2, 3, 4, and 5 seconds. If the Web server has a slow response time, you might want to increase the time slightly from the default of 1 second. Leave this setting at the default if you're unsure about your server's response time. Increasing the wait time to more than 5 seconds can cause the Sleep command in the script to fail.
- The SMTP server entries if you want to send email or pager messages. Enter the server address and the information for the From and To fields. Use a comma to separate multiple recipients.
Cover All Your Bases
If you use WLBS.pl with WLBS, you'll have high redundancy in your system. If a Web server fails, WLBS will transfer that server's Web service to the other cluster members. If a Web server is online but has a Web service problem, WLBS.pl will detect the problem and transfer that Web server's load to the other cluster members. So, barring the simultaneous failure of all the cluster members, you'll have all your bases covered.