Clusters for Everyone

The other day, a hardware failure brought down our Exchange server. This failure created a panic in our user community because we consider email availability as important as a dial tone. Had we been using a Windows NT cluster, we users would never have noticed the problem. By providing continuous availability through replication, an NT cluster could have saved us a lot of frustration and prevented the loss in productivity.

Today's NT clustering solutions solve one business computing problem: availability. By replicating data, applications, and even entire systems, clustering lets two or more systems watch each other's back and take over the workload (user connections, applications, and services) in case one system fails. This article will review the types of clustering solutions currently available, categorize clustering solutions, and illustrate what types of business computing problems clustering can help solve now.

So What's a Cluster Anyway?
A cluster is a group of whole, standard computers that work together as a unified computing resource and that can create the illusion of being one machine, a single system image. (With NT clusters, the term whole computer, which is synonymous with node, means a system that can run on its own, apart from the cluster. If you're not familiar with clustering terms, you can refer to "Clustering Terms and Technologies.") This unified computing resource ensures availability because any node can take on the workload of any other node that happens to fail.

Clusters come in three configuration types: active/active, active/standby, and fault tolerant. Let's examine each of the three types of cluster configurations:

Active/active: All nodes in the cluster perform meaningful work. If any node fails, the remaining node (or nodes) continues handling its workload and takes on the workload from the failed node. Failover time is between 15 seconds and 90 seconds.
Active/standby: One node (the primary node) performs work, and the other (the standby, or secondary node) stands by waiting for a failure in the primary node. If the primary node fails, the clustering solution transfers the primary node's workload to the standby node and terminates any users or workload on the standby node. Failover time is between 15 seconds and 90 seconds.
Fault tolerant: A fault-tolerant cluster is a completely redundant system (disk and CPU) whose goal is to be available 99.999 percent of the time. That goal translates to fewer than 6 minutes of downtime per year. Both nodes of a fault-tolerant cluster simultaneously perform identical tasks; the nodes' workloads are redundant. Failover time is less than 1 second.

To illustrate the definition of a cluster, let's say you have users doing file and print on Server A and another group of users accessing an Oracle database on Server B. Servers A and B are nodes in an active/active cluster. If Server A fails, Server B continues handling its workload and picks up Server A's workload. The users accessing the Oracle database do not notice any change in their service; the users doing file and print at most experience a short delay.

NT Clustering Solutions
As the need for availability becomes ever more crucial in the NT environment, many third-party vendors and Microsoft have introduced or are about to introduce clustering solutions for NT. To help you evaluate these clustering solutions, let me briefly explain Microsoft's clustering initiative, Wolfpack, and categorize its capabilities in comparison with those of some prominent third-party clustering solutions. (For reviews of several individual clustering products, including Wolfpack, see Lab Reports.)

Wolfpack
Wolfpack is Microsoft's two-node, active/active clustering solution and set of APIs for NT. Wolfpack's purpose is to provide high availability to your NT Server environment.

Wolfpack will have an effect in several significant areas. First, you can expect all server manufacturers who want to reach NT customers to offer Wolfpack-based clustering support this year. Even a year before its release, Wolfpack had the backing of Digital Equipment, Compaq Computer, Tandem, Intel, Hewlett-Packard, NCR, and IBM.

Theoretically, Wolfpack will work on any two Intel-based or any two Alpha-based servers, but you can't mix Intel and Alpha. However, in practical terms, the number of supported systems will be very restricted because to get on the Wolfpack Hardware Compatibility List (WHCL), each manufacturer must test complete configurations (system, disk subsystem, and SCSI adapter) for compatibility. This approach stands in contrast to NT's existing Hardware Compatibility List (HCL), which lets manufacturers list individual system components. For the WHCL's first release, Microsoft will let each manufacturer list only two configurations. Microsoft will support Wolfpack only for systems on the WHCL, so don't try to build your own Wolfpack clustering solution. Although these requirements will initially limit the selection of Wolfpack-compliant configurations, the WHCL will grow over time.

The second area that Wolfpack will affect is storage. In a Wolfpack-based solution, you need only enough storage in your servers to run NT Server and Wolfpack. A disk subsystem that both servers share will provide the bulk of your storage. As a result of this approach, server manufacturers will want to differentiate themselves by improving their storage performance. Those manufacturers that don't have their own subsystems will have to obtain them from storage providers such as CMD Technology, Data General, and BoxHill Systems. Some manufacturers, such as Compaq, will use clusters as a way to promote fibre-channel based storage solutions because fibre-channel storage has significant advantages over SCSI, in both throughput and cable length.

Third, Wolfpack will affect server applications. Wolfpack is not only a clustering solution, but a set of APIs. These APIs let developers make their server application "cluster aware." Such awareness could mean easier installation in a clustering environment, better failover capabilities, and the ability to scale an application beyond one node. For example, Microsoft plans to use the Wolfpack APIs with its Transaction Server to let two nodes work on the same SQL Server database query. This technology combination is fundamental to Microsoft's plans to provide enterprise scalability.

The Wolfpack APIs have been available to developers for only a short time, so only a few applications will initially be available. However, as the adoption of clusters becomes more commonplace, the demand for cluster-aware applications will increase as well. Expect Microsoft's BackOffice applications to become cluster aware during 1997 and 1998.

Fourth, Wolfpack will have an impact on other NT clustering solutions. Many competing NT clustering solutions have already declared support for the Wolfpack APIs. This API support will let Microsoft's competitors support Wolfpack cluster-aware applications and still provide enhanced functionality over the Wolfpack solution.

Finally, the price and availability of Wolfpack-based solutions will drive NT cluster solutions into the mid-to-low end of the server market. The price of Wolfpack-based solutions is about 20 percent of the price of solutions available for UNIX. This pricing alone will make companies that have never considered clustering take a look at it. In addition, the availability of Wolfpack-based solutions from many vendors will create competition, improve awareness in the market, and help stimulate demand in the mid-to-low markets that they serve.

Third-Party NT Clustering Solutions
Wolfpack isn't the only game in town. In fact, several solutions are more mature than Wolfpack, offer additional functionality, and solve different problems. Table 1 lists some prominent solutions (including Wolfpack) and categorizes the type of clustering solution they offer, their data-handling strategy, their hardware interconnect, and their flexibility in hardware choices. (For a summary of information about the clustering solutions reviewed in this issue, see "Clustering Solutions Feature Summary," and for information about other clustering solutions, see "Buyer's Guide to Clustering Solutions.") Let's look at some of the categories in Table 1, and then we can apply our knowledge of clustering solutions to some real-life scenarios to determine what solution is best for a given situation.

Data handling. NT clusters use one of three data-handling methods: mirroring, switching, and redundancy. In mirroring, one node replicates another node's data. Octopus, NSI, and Vinca rely on this technique. With switching, each node has its own disk source, which may be RAID or just a bunch of disk (JBOD). Both nodes share a SCSI bus, which lets them take over the failing node's disk. Finally, with redundancy, the clustering solution writes data to both nodes simultaneously.

Hardware interconnect. The hardware interconnect is the required physical link between the nodes in the cluster. Several solutions require proprietary connection devices. Other solutions use any type of TCP/IP-supported connection, such as Ethernet.

Hardware flexibility. The hardware flexibility column in Table 1 rates available choices for nodes. For example, Stratus' solution works on only Stratus hardware and is therefore rated poor in the flexibility column. Wolfpack requires manufacturers to list complete configurations--not components--on the WHCL, and therefore, receives a rating of fair. Octopus will work with any NT-based servers (Intel, Alpha, MIPS, PowerPC), and therefore, is rated excellent. Vinca will work with any two NT-based servers (Intel only) and therefore, is rated good.

Scenarios
A variety of clustering solutions can solve availability problems in an NT environment. The purpose of the following scenarios is to show how you can apply clustering solutions to solve specific problems.

SITUATION 1
Expanding Your File and Print Server
Problem: Your company has a single-processor Pentium-based NT Server that you use for file and print, and it is running out of steam. Your applications include a heavily used multi-user Access97 database and Office97. You have to reduce downtime, especially with the Access97 database, which has become critical.

Solution: If you buy an additional server, you can use a mirror-based solution such as Octopus to connect the two servers into a cluster. Now you can ease your capacity crunch by putting your Office97 files on one server and Access97 on the other server. At the same time, you can replicate critical data between the servers and create a fault-resilient environment.

Could you use Wolfpack in this situation? You could, only if your new configuration is on the WHCL, which is highly unlikely right now. Also, Wolfpack requires a SCSI-based disk subsystem, which is an extra purchase.

SITUATION 2
Setting Up a Web-based Storefront Using Merchant Server
Problem: Your company has decided to take orders and payments over the Internet. For optimum performance, you decide to run Merchant Server and Internet Information Server (IIS) on one server and SQL Server on another. Because both servers will have active users, you need an active/active clustering solution. A 30-second delay is acceptable during failover. You have 30 days to deliver.

Solution: Wolfpack isn't shipping yet, so you can go with either LifeKeeper or FirstWatch. Because you have no existing equipment, you can buy a SCSI-based solution (two servers and one disk subsystem) from a single vendor. One possible solution is Data General's NT Cluster-in-a-Box, which comes to you with everything preconfigured from the manufacturer. (For a review of this solution, see "NT Cluster-in-a-Box.") If you can wait until Wolfpack ships, it will also solve your problem.

SITUATION 3
Credit Card Verification Service
Problem: You've decided to cash in on the electronic commerce craze and provide realtime verification for credit card transactions on the Internet. Even a few seconds of failure could result in the loss of millions of dollars of transactions.

Solution: If you're brave enough to try this service on NT, your only solution today is from Marathon Technologies because it's the only solution that offers subsecond failover times and eliminates the need to restart user transactions. Its configuration duplicates both memory (redundant compute nodes) and disk (redundant data nodes).

Marathon Technologies' solution takes four off-the-shelf computers working together to create a cluster. (For details about this solution, see the sidebar, "Marathon Technologies' Endurance 4000.") You do not need to make any software changes.

SITUATION 4
Hot-Site Backup
Problem: As part of your disaster recovery plan, you want to maintain a hot site in case your primary site is destroyed. This plan requires the ability to mirror a server to a location 20 miles from the primary site.

Solution: Most clustering solutions today assume that the cluster nodes are within two miles of each other. Therefore, you need a solution that can provide mirroring across a WAN. Currently, only Octopus, NSI, and Vinca can provide this functionality. (For reviews of these solutions, see "Octopus SASO 2.0," "Double-Take 1.3 Beta," and "Vinca StandbyServer for NT.")

SITUATION 5
Remote Application Access
Problem: You need to provide fault-tolerant remote access to your 500-member sales force. They need 24*7 remote access to your company's applications.

Solution: A Citrix server will solve the remote application access problem. Cubix offers a fault-tolerant solution for Citrix servers by providing load balancing and failover for multiple Citrix servers in a manageable communications cluster. (For a review of the Cubix solution, see "RemoteServ/IS.")

SITUATION 6
OS/2 Users Need Access to Lotus Notes 4.0
Problem: Your OS/2 client users need immediate access to Lotus Notes 4.0 for NT. Lotus Notes is a critical application, so if users lose access for longer than 90 seconds, you're fired.

Solution: Vinca's StandbyServer for NT is one of the few solutions that support OS/2 clients. IBM is one of Vinca's key distributors and provides OS/2 support. Purchase a new server to run DB2/NT, and use the old server as a standby server.

SITUATION 7
Schedule Upgrades to Your System
Problem: You would rather not spend all your nights and weekends upgrading your systems.

Solution: By putting your servers into a cluster group, you can manually fail over a node during working hours. Remember, the users are still working on the remaining node. Now you can apply a service pack, test it, and pray.

Once you are satisfied that the service pack changes are working, you can manually fail back the node and the workload. Any NT clustering solution currently available will work in this scenario.

SITUATION 8
Manually Load Balancing Your System
Problem: You have too many applications running on one server while another server is barely used.

Solution: Ordinarily, you have to take down both servers, change their configuration, and restart. If the servers are part of an active/active cluster group, you can manually fail over a single application without taking down an entire node. This approach effectively moves the application from one server to another.

You must make sure the solution supports application failover (as opposed to system failover). Application failover lets you fail over a single application without taking down the entire node, instead of failing over the entire system. For example, even though Octopus is active/active, it supports only system failover today, which requires taking down the node. However, soon after you read this article, Octopus SASO 3.0 will be shipping, and it supports application-level failover.

SITUATION 9
Two SQL Servers
Problem: You need high availability for users accessing two independent SQL Server databases, each running on a separate server.

Solution: You need an active/active application clustering solution so that both nodes can be running SQL Server simultaneously. This requirement eliminates Wolfpack from your list of choices, because it can run only one instance of SQL Server per cluster. However, Digital Equipment's Wolfpack clustering add-on pack and NCR's LifeKeeper let you run two copies of SQL Server in the same cluster, allowing each server to be the fallback for the other and thus increasing availability.

SITUATION 10
Scaling Exchange Problem: You want to scale Exchange to run faster and have high availability. You have a dual Pentium Pro server.

Solution: Adding two CPUs to your server configuration would be nice, but unfortunately, Exchange scales effectively to only two CPUs (for more information about Exchange's ability to scale, see Joel Sloss, "Optimizing Exchange to Scale on NT," November 1996). In fact, the next release of Exchange (version 6.0) has been dubbed the "performance release" and will address this scalability problem. Wolfpack won't address scalability until phase 2, which isn't due until 1998. So are we stuck?

Valence Research's Convoy Cluster claims to add availability and scalability for TCP/IP applications and to provide load balancing among nodes in a cluster. This product is primarily aimed at intranet applications. Convoy Cluster was not available when we tested solutions for this issue. If this solution can scale, it will leapfrog Wolfpack by a year.

Future Trends
As these scenarios demonstrate, Wolfpack is not the appropriate solution in every case. Even so, Wolfpack is having a huge effect on hardware and software vendors.

When Wolfpack phase 2 starts shipping in 1998, developers can use the Wolfpack APIs to create applications that will let cluster nodes work in parallel. The issue of scalability will start a heated debate among system vendors: Is a cluster of 4-way SMP systems better than 8-, 12-, and 16-way SMP systems? If the answer is yes, NT will never have to scale beyond four CPUs in a single system. As long as you can cluster 4-way systems and scale performance, NT will have a price and performance unrivaled in the marketplace.

In the early adoption phase, companies will want to buy complete cluster-in-a-box configurations, hoping to eliminate as many problems as possible. However, as clustering moves mainstream, users will demand the ability to mix and match components. Keeping up with NT's HCL is hard enough, and keeping up with the WHCL will be even harder. Octopus has been on the leading edge for more than two years, by letting users mix and match components easily. Other vendors will need to do the same.

As more system vendors support Wolfpack, additional features will provide a competitive advantage. For example, Digital supports Wolfpack, but also offers a cluster add-on package that lets both nodes of a cluster run SQL Server and gives existing users of Digital NT Cluster a migration wizard. Compaq, Tandem, and Dell will enhance their Wolfpack offerings by supporting ServerNet, a high-speed interconnect. NCR supports Wolfpack, but also supports LifeKeeper, which allows three-node clusters, compared with Wolfpack's two-node limitation.

Finally, look for other vendors to solve the scalability problem before Wolfpack. For example, Oracle Parallel Servers lets two or more Oracle database server nodes work on the same database, running queries in parallel on multiple nodes. Oracle will try to one-up Microsoft by shipping this level of scalability on NT before Microsoft can release the parallel version of SQL Server (version 8.0).

Corrections to this Article:

In Mark Smith's article, "Clusters for Everyone," we incorrectly reported that Stratus uses a proprietary interconnect. In fact, Stratus uses standard, redundant 100Base-T connections. Though Isis Availability Manager (cluster software) runs on only Stratus, Stratus hardware can run multiple clustering software solutions, including Microsoft's Wolfpack clustering software. Finally, Stratus uses mirroring technology rather than SCSI-switching as was originally reported. For more information, visit the Stratus Web site at http://www.stratus.com.

Comments

Plain text