Several years ago, I was an administrator of a large, 24 * 7 IBM shop. I was the guy everybody called in the middle of the night when the system went down. I lost many nights of sleep while trying to diagnose a problem for a wide-awake systems operator. I needed a fail-safe solution that would let me sleep through the night and fix the problem when I got to work the next morning.
Sleeping better is a major reason many of the administrators I interviewed for this article love NT clusters: These administrators can rest peacefully knowing clusters are protecting their systems from failure. The administrators I interviewed said their solution worked as advertised: They can recover their systems with minimal intervention, and most end users are unaware of any failure.
The good news is that NT clustering solutions are available today. I'm not talking about theory; I have case studies to share. The bad news is that some companies implement clustering because NT alone isn't stable enough for their environments. In some cases, clustering is like blue block--maximum blue screen of death protection. Smear on some General Protection Fault (GPF) 30 to protect yourself from those nasty blue rays.
This article looks at how several organizations use NT-based clustering to satisfy availability and scalability needs. The administrators I interviewed for this article used different criteria to evaluate clustering solutions for their unique situation. Based on these interviews, I've concluded that no single solution solves every availability and scalability problem--the market demands a variety of solutions. I hope I've included enough real-life scenarios so you can see an NT clustering solution that might meet your organization's needs.
|Cluster use: Various Windows NT applications for remote user|
|Solutions: Two Cubix RemoteServ/IS dual Pentium Pro server cabinets, Citrix WinFrame|
BlueCross/BlueShield of Oregon
As part of a large, nationwide insurance provider, BlueCross/BlueShield of Oregon combines Citrix WinFrame software with Cubix RemoteServ/IS hardware to create a clustering solution that supports its remote users. BlueCross/BlueShield supports its healthcare facilities, employees, and partners through this connection so users can access centralized billing and patient information. In this configuration, WinFrame provides remote access and Cubix adds availability and load balancing.
Cubix provides clustering within one cabinet, which reduces the need for computer room floor space. BlueCross/ BlueShield configured each of two Cubix cabinets with two dual-processor Pentium Pro systems and one single-processor Pentium Pro system. The Cubix hardware is currently configured to let as many as 15 users dial in simultaneously. However, BlueCross/BlueShield can expand the Cubix system well beyond this configuration. BlueCross/BlueShield plans to replicate this solution as the need arises. The Cubix hardware keeps the cluster load-balanced, and in the event of a failure, the system redirects a user to an available node.
"The management software is really slick. You can instantly see errors reported to the administrator's desktop," said systems administrator David Blackledge. "The Cubix hardware is really easy to maintain, and administrators can support the system from their desk."
Although Blackledge recommends this solution to anyone looking for solid remote access, he would like to see a more flexible licensing model. In a WinFrame environment, the licenses are tied to the processors. If one CPU dies, your licenses might not transfer to the surviving node. Certain licenses can float between processors; however, you must have a minimum of five licenses per motherboard.
|Industry: Online bookstore|
|Cluster use: Electronic commerce|
|Solutions: Valence Research Convoy Cluster Software, Qualix Group Octopus DataStar, Microsoft Internet Information Server 3.0, HP NetServers, Server Technology Sentry Remote Power Manager (http://www.servertech.com), Ipswitch WhatsUp (http://www.ipswitch.com)|
Books.com claims to be the first Web-based bookstore to offer online purchasing of books, videos, and music. The company went live in 1992, and Books.com now serves more than 60,000 user sessions per day from its clustered Web site.
To update information for its on-line store, Books.com developers make changes to an NT file server at one location. The company uses a T1 connection and Octopus DataStar software to replicate changes to three separate nodes in another location. To cluster and load-balance the three nodes, Books.com uses Convoy Cluster Software on HP NetServers (for information about Convoy Cluster Software, see Jonathan L. Cragle, "Load Balancing Web Servers," page 68).
Figure 1 shows the Books.com cluster network model. This configuration lets each node handle one-third of the user load. When a user visits the Web site, the system combines files from the NT file server with data in Sybase and Oracle databases to dynamically generate the information the user's Web browser displays.
"The most common problem is blue screens on NT Server," said administrator Dennis Anderson. When a Convoy node fails, the other nodes pick up the load, and end users are unaware of any disruption in service. Fortunately, Anderson can Telnet into the Sentry Ambassador remote power-up box and restart the system if a node fails during the night. After the system reboots, the recovered node can rejoin the cluster. Octopus DataStar then uses its journal of changes to synchronize the node. Most of the objects the system replicates to the nodes are small HTML files, so the recovered node usually resynchronizes within 30 seconds after rebooting. The Convoy node then rejoins the cluster about 10 seconds later.
Books.com required load balancing and failover in its clustering solution, so it had to eliminate all but a few solutions from consideration. The company downloaded a demonstration of Convoy Cluster Software from Valence Research's Web site. The demonstration helped seal Books.com's decision. "Convoy Cluster Software performs really well," said Anderson. "You don't really notice it, but it works."
Despite this solution's success, the company has discovered one annoying problem: Convoy can't detect when Internet Information Server (IIS) fails. As a result, when IIS fails, the entire cluster fails. Anderson uses Ipswitch's WhatsUp to work around this problem. Now if IIS fails, WhatsUp stops that node, and Convoy removes it from the cluster and alerts Anderson via pager. Anderson hopes Convoy will detect this type of problem in future versions.
"NT is not a very robust Web serving platform," said Anderson. "NT has a lot of maturing to do." Specifically, Anderson would like Microsoft to focus on reliability.
|Cluster use: Manufacturing control expert system|
|Solutions: Marathon Technologies Endurance 4000 fault-tolerant cluster, Gensym G2, Microsoft SQL Server, Oracle, SAP|
Celanese runs a continuous flow (24 * 7) process to manufacture 1000-pound to 1200-pound (4' * 4' * 4') bails of acetate cellulose toe for producing cigarette filters and suit liners. If the process stops for 1 minute, the bails harden and require a massive cleanup and restart process that can take days.
In the past, Celanese employees had to continually measure the manufacturing equipment (e.g., programmable logic controllers, scales, presses, extrusion devices, sensors, dryers) to determine whether individual bails met the company's strict quality standards. Now, the company has automated the process using Gensym's G2, an NT-based software solution. G2 continually receives measurements from the manufacturing equipment and uses its built-in expert system software to determine the quality of the bails. G2 records quality measurements into its SQL Server database and adjusts equipment as necessary. At specified intervals, the SAP production planning system queries the database for acceptable bails and records them into an Oracle NT database.
So why did Celanese decide to use an NT cluster solution? "We felt like that's where everything was headed. Doing the same thing with UNIX would cost $500,000," said administrator Jim Fraser. "The advantages of having a common platform for our business and manufacturing users are too numerous for the accountants to ignore. I'm not afraid to use NT. I supported HPUX for 7 years, and NT is just as stable as HP."
When Celanese automated its manufacturing process, the company had only one requirement: absolutely no downtime. This requirement let Celanese narrow its search for a clustering solution to one vendor--Marathon Technologies. Marathon's Endurance 4000 software and hardware solution is truly fault tolerant. Both the data and compute nodes are completely redundant. As a result, client machines don't need to restart following a system failure, and the G2 software can't fail. Future versions of the product will support symmetric multiprocessing (SMP) nodes. Endurance 4000 ties four systems (Celanese uses four 200MHz servers) together to create one cluster. Figure 2, page 127, shows the Celanese cluster network model.
Celanese selected Marathon's Endurance 4000 because it's the only solution available with sub-second failover time. In fact, it doesn't really fail over, it just disconnects the redundant node.
Celanese has experienced two hardware failures, and the Marathon cluster worked both times without a hitch. In less than 5 milliseconds, the surviving node took over the load. "Marathon Endurance 4000 is a wonderful solution," said administrator Jim Fraser. "Marathon works hard for its customers."
First Union Capital Markets Group
Corporate email is the mission-critical application of the 90s. Take your mail server offline for a few minutes, and watch your Help desk light up like a Christmas tree. First Union Capital Markets Group in North Carolina uses Microsoft Cluster Server (MSCS) to keep its Exchange and file and print servers running 24 * 7. Previously, the company had to use twice as many clusters to do the same amount of work they do today. "In the old days, I had Compaq standby clusters. Now I use active/active clusters, and both nodes are working," said Sid Vyas, First Union CIO. "I'm saving a huge amount of money on the hardware."
Vyas recommends a single-vendor clustering solution. During the testing phase, First Union unsuccessfully tried to mix and match hardware. Vyas also recommends a fibre channel connection over a SCSI-switching solution for increased throughput on the disk, and a 50 percent faster failover time, and an increased length of cable between nodes (500 meters vs. 25 feet).
Vyas chose Compaq ProLiant servers to run MSCS because Compaq was the only company to certify a fibre channel connection. This configuration lets First Union place its servers and storage in separate buildings and keep nodes in separate data centers on different floors of the building. Distributing the computing resources increases the fault tolerance in case of a disaster.
Vyas admits MSCS has a problem with duplicate shared names. Two shares can't have the same name after failover. If you have the same share names on each node, the failing node share will disappear. In addition, print queues must have unique names on each node, even though they might point to the same printer. First Union has notified Microsoft of this shortcoming, but was still waiting for a solution at press time.
Vyas said that future plans include clustering SQL Server. First Union's database of choice is Sybase on UNIX; however, the company is developing many new applications on SQL Server.
|IBM World Registry Division, Washington, D.C.|
|Industry: Top secret software development|
|Cluster use: High availability software development environment|
|Solutions: Computer Associates ARCserve Replication for Windows NT|
IBM World Registry Division
Imagine developing applications that are so top secret that you can't back them up on tape. This scenario became reality for Mark Shoger of Keane Federal Systems. IBM World Registry Division (WRD) hired Keane to help with the company's development efforts. Keane said WRD needed a realtime backup system to handle open files and systems policies. WRD couldn't have any removable media, because it would void the company's top-secret classification requirements. Finally, WRD needed 99.99 percent availability and no data loss. To meet these requirements, the company turned to Computer Associates' ARCserve Replication realtime backup and recovery system.
WRD has 500 users attached to four large NT servers that handle development, and the company runs Lotus Notes Domino for group communications. Each of the four primary servers connects to a backup server that mirrors the data on the other four servers. Figure 3, page 128, shows the WRD cluster network model.
ARCserve Replication runs on each server and monitors threshold levels such as hard disk space and network performance. The primary servers are dual Pentium II systems with 512MB of RAM and 4GB of storage (16GB total), and a backup server with 20GB of storage.
The payoff for using this NT-based solution is simple. If a problem occurs, such as a hard disk crash, ARCserve Replication detects the problem and switches users to the backup server within a few seconds. The users are unaware that any change has taken place. Shoger can replace the failed hard disk at his leisure. He then initiates the failback procedure, which synchronizes the new disk and reroutes the users to the primary server. "I've been doing network administration for a long time and this failure and recovery process impresses me," said Shoger. "One time, a NIC failed and the system ran the whole weekend on the backup server before I noticed it," said Shoger. He also points out that the ARCserve Replication software was easy to install and maintain.
Shoger recommends this system for large networks. It requires an extra system for backup and recovery which may be prohibitive for small networks. If you need this kind of protection, Shoger recommends using a backup system that has 25 percent more power than any of the systems it's protecting.
Looking ahead, WRD might implement an additional backup server at a remote site for disaster recov- ery. Such a configuration would help keep the company up and running, even if the primary data center blew up.
|John C. Lincoln Hospital, Arizona|
|Cluster use: Increase availability of PeopleSoft (Oracle on NT), Office 97, and medical application (Cerner) for a level 1 trauma center|
|Solutions: Nine two-node clusters, Vinca StandbyServer for Windows NT, Compaq ProLiant servers|
John C. Lincoln Hospital
Downtime is not an option for the John C. Lincoln Hospital level 1 trauma center in Arizona. To maintain its level 1 status, the hospital must be able to respond to a life-threatening emergency at all times. Vinca StandbyServer for NT software keeps the trauma center's 7000-user NT environment continuously running.
So how does this solution work? Imagine that the primary server fails and displays a blue screen. Within 30 seconds, the first set of users are working on the standby server. However, because not all applications fail over gracefully to the standby server, some users experience a GPF and have to reboot. After they reboot they automatically connect to the standby server and are up and running. To restore the primary server, you simply break the mirror, reboot the primary server, re-establish the mirror, and reboot the primary server again.
The hospital chose Vinca because of its low overhead on the primary server. "Vinca runs clean and light, and you hardly know it's there," said Mark Jablonski, former network administrator for the hospital. "Overall, Vinca is a sleep saver. If users are working at night, you can keep sleeping," said Jablonski. "Anything that keeps my beeper from going off is a friend of mine."
Jablonski recommends researching the resource overhead before you buy. "If the Primary Domain Controller goes over 50 percent CPU utilization, it's hard to log on to the PDC," he said. He recommends checking the cluster solution to ensure the CPU utilization doesn't go through the roof. You will also want to check your clustering solution against your night load when you run backups, virus scanners, and other administrative applications. "It's at night when you get beeped," said Jablonski.
Besides researching the resource overhead, you need to check the reliability of the clustering solution: Does it fail over five out of five times? "Vinca is successful 90 percent of the time. Sometimes you must restart services manually on failover," said Jablonski. Also, look for quality support. Vinca's support is helpful and knowledgeable. For 24 * 7 support, you can purchase Vinca's 24 * 7 premium support.
The hospital plans to implement an active/active cluster. With the current active/standby configuration, nine of the nodes aren't active. Jablonski said the hospital is investigating active/active solutions.
|Surplus Direct, Oregon|
|Industry: Online auction and discount computer store|
|Cluster use: Electronic commerce|
|Solutions: Resonate Central Dispatch, which includes Dispatch Manager; Microsoft Visual SourceSafe; Microsoft Internet Information Server 3.0; Microsoft Cluster Server; Microsoft SQL Server 6.5, Enterprise Edition; Tandem CS150; LANWARE NTManage; Westwind Technology Webconnect|
Do you want a great deal on hardware and software? That's the promise of Surplus Direct, which acts as a clearinghouse for publishers, distributors, and retailers of overstocked, factory refurbished, or distressed inventories. Surplus Direct sells or auctions these items over the Web. To provide its customers with the best service, the company needed a solution that could run 24 * 7 and scale easily. In addition, the company dynamically generates about 90 percent to 95 percent of its Web pages from a SQL Server database. These requirements led Surplus Direct to use a combination of NT products for its clustering solution. Figure 4 shows the Surplus Direct cluster network model.
Surplus Direct uses Resonate's Dispatch Manager software, which is part of Resonate's Central Dispatch product, to monitor incoming Web traffic by open connections, CPU load, and network latency. The software balances the incoming Web traffic with its clustered schedulers. The schedulers assign the workload to one of six front-end IIS-based Web servers. These schedulers get a workout right before auction closing at 11 a.m. each day when Web traffic spikes considerably. The Web servers request data from SQL Server systems running on Tandem CS150 clustered hardware. Surplus Direct uses MSCS to provide the clustering software.
Surplus Direct likes using Dispatch Manager to produce bar charts and tables to graphically monitor the pool of schedulers and Web servers. In addition, the company uses LANWARE's NTManage to graphically monitor the network traffic running through its routers, hubs, and switches.
Surplus Direct also takes advantage of the system's scalability. The company can easily add front-end servers, which increase the throughput of the Web sites. Surplus Direct uses Visual SourceSafe to handle source-code version control and replicate pages and changes among all Web server nodes. The company uses Westwind Technology's Webconnect to access SQL Server from the Web servers. "I've been able to sleep better at night because there's no single point of failure," said administrator Mark Daley.
Surplus Direct looked for hardware solutions during evaluation, but couldn't find anything that supported virtual IP addresses. The company needed each user session to stay in the assigned pool. Surplus Direct also needed a software solution that let it use the fastest machines available and add as many Web servers as necessary. "Be patient in finding the right solution--stay very objective," advised Daley.
|Tulip Computers, Netherlands|
|Industry: Computer manufacturing|
|Cluster use: SAP materials management, production planning, sales, and distribution|
|Solutions: NCR's LifeKeeper two-node cluster, SAP, Oracle 7 on Windows NT|
Is NT ready for the enterprise? Tulip Computers thinks so. The company produces and develops PCs for the European and Asian business-to-business market. In addition, Tulip recently purchased Commodore, which makes computers for the European consumer market. Tulip currently has 700 employees and revenue of $300 million.
Tulip uses a two-node LifeKeeper active/standby cluster to run an Oracle 7 on NT database on NCR 4300 4 * 200 hardware with 1.5GB of RAM and 80GB of hard disk space. Five different application servers running SAP's materials management, production planning, sales and distribution, warehouse, and financial controlling modules access the Oracle on NT database. Each application server runs on an NCR 4-way SMP server. Although Tulip hasn't clustered these applications, the company plans to put the SAP application servers into LifeKeeper clusters to maximize availability. Figure 5 shows Tulip's cluster network model.
John Hoogendoorn, Tulip's IS manager, recommends running a database application on an active/standby cluster configuration instead of an active/ active configuration. He believes you can more easily recover and manage this environment. Hoogendoorn also recommends finding a cluster-aware backup solution. Tulip's current backup solution isn't cluster aware, so Hoogendoorn bought a separate backup system for the standby node of the cluster.
"The NCR hardware has worked so well that we haven't seen it fail over in production," said Hoogendoorn. "But we've tried to manually failover and that worked." Hoogendoorn said the system completely failed over in 5 minutes. LifeKeeper has developed various recovery kits (or scripts) to handle specific recovery needs for applications such as SAP and Oracle. For a complete list of recovery kits, visit LifeKeeper's Web site.
\[Editor's note: This article assumes familiarity with clustering. For an overview of clustering for Windows NT and a list of clustering-related terms and technologies, see Joel Sloss, "Clustering Solutions for Windows NT," June 1997.\]
|ARCserve Replication for Windows NT
Computer Associates * 800-243-9462
Web: http://www.cheyenne.com/ storage
Citrix Systems * 954-267-3000
Convoy Cluster Software
Valence Research * 503-531-8718
Digital Clusters for Windows NT
Digital Equipment * 800-344-4825
Microsoft Cluster Server
Microsoft * 425-882-8080
Web: http://www.microsoft.com/ ntserverenterprise
Marathon Technologies * 978-266-9999 or 800-884-6425
Web: http://www.marathon technologies.com
NCR * 937-445-5000
Web: http://www.ncr.com/ product/nt
Qualix Group * 650-572-0200 or 800-245-8649
Cubix * 702-888-1000 or 800-829-0550
StandbyServer for Windows NT
Vinca * 801-223-3100 or 888-808-4266
Corrections to this Article:
- "NT Clustering Solutions Are Here" incorrectly identified Sid Vyas as the CIO of First Union Capital Markets Group. His correct title is vice-president (non-UNIX servers). Wayne Ginion heads Capital Markets' technology division.