SQL Server Clustering

Installation, integration, and 13 clustering gotchas

Microsoft Cluster Server (MSCS) is an important technology that brings a new level of high availability to the Windows NT 4.0 platform. One well-suited candidate for this clustering technology is Microsoft SQL Server. Microsoft has done a good job of integrating SQL Server into the MSCS environment after the fact.

However, Microsoft hasn't done a good job of describing how to install and manage SQL Server in a clustered environment. We learned that SQL Server's setup instructions lacked important information when we first began installing SQL Server-based clusters. Fortunately, we worked through our initial problems and successfully implemented a SQL Server-based MSCS environment. Now we want to help you do the same.

Before we discuss how to plan for and install SQL Server-based clusters, we first introduce you to clustering basics and describe how they apply to SQL Server. We also share a few of the gotchas we experienced during SQL Server cluster implementation.

Clustering Basics

You can think of a cluster as a group of independent computers working together. In the case of MSCS, you join two NT servers, or nodes, via a shared disk subsystem so that the nodes share data to provide high resource availability. The combination of these two servers is a cluster.

Many users think they're connecting to a different server when they access a cluster's resources. A cluster appears as a separate, or virtual, server because the cluster's NetBIOS name and IP address differ from those of either node. But, in reality, both nodes still exist as independent entities.

MSCS supports two-node failover, which means you can have up to two nodes in your cluster, each of which can act as a backup to the other. You can set up your virtual server so that resources, such as a print spooler, file share, or database service, fail over from one node to the other, providing high system availability. MSCS will eventually support scalable clusters, which means that you can incrementally add processing power to the cluster by adding a new node. However, scalable clusters won't be available from Microsoft until well after Windows 2000 (Win2K) ships.

How SQL Server and MSCS Integrate

SQL Server Enterprise Editions 6.5 and 7.0 leverage MSCS's 2-node failover to offer two modes: active-passive and active-active. These modes are modes of the SQL Server application, not the cluster. For the active-passive mode, you create one instance of SQL Server that can run on either node. The other node serves as a hot backup. For active-active mode, two virtual instances of SQL Server are running on the cluster. Although the instances typically run on separate nodes, one node can run both instances. Even if both instances are running on the same physical server, most users think they're accessing separate servers.

In active-passive configurations, the server has one master database (master.dat) and one set of user databases. In active-active configurations, each virtual server has a separate master database and a separate set of user databases because virtual servers can't share data.

MSCS requires each SQL Server instance to have a private set of disk resources. Although nodes can share disk resources, only one node can actively control a disk at a time. Compaq's Distributed Lock Manager (DLM) and Cluster File System (CFS) technologies let both nodes simultaneously access the same disk. However, although Microsoft has a license for the DLM and CFS technologies, SQL Server can't use them at this time.

A common question is whether one mode is better than the other. To answer this question, you need to look at their differences in performance and cost. Suppose you have a SQL Server-dedicated cluster whose nodes have comparable power. During failover, the performance of an active-active configuration degrades because one server must manage both virtual SQL Server instances. In other words, if each node handles 100 users in a normal operation, one node must handle 200 users during a failure. The active-passive configuration provides better performance than its counterpart during failover because the cluster is running only one SQL Server instance.

However, during normal operations, the active-active mode uses server resources more efficiently than the active-passive mode because you're using both nodes. Clusters are expensive to build if you use the second node only for failovers.

Planning Your Cluster

The first step in planning your cluster is to make sure you have the correct software and licenses. To install SQL Server on MSCS, you need:

Two copies of SQL Server 6.5 or 7.0, Enterprise Edition
Two copies of NT Server 4.0, Enterprise Edition, including Service Pack 4 (SP4)
Licenses for the SQL Server and NT servers and clients

For more information about licensing, see the sidebar "The Licenses You Need."

If you're using SP4, the Microsoft article "BUG: Automatic SQL Cluster Failover Doesn't Work w/WinNT 4.0 SP4" at http://support.microsoft.com/support/kb/articles/q216/6/74.asp is essential reading.

After you've installed the necessary software and obtained the licenses, you need to decide which SQL Server cluster mode to use. You can save a few headaches later on by making this decision up front. Because each virtual server needs a separate set of disk resources on the shared disk subsystem, you can run into trouble if you configure the I/O subsystem as one large array in the active-passive mode and then later decide you need to support the active-active mode.

After you've selected the mode, you can plan your cluster's configuration. Suppose you want to set up your cluster in the active-active mode and your cluster's shared disk subsystem is an external RAID cabinet with 10 drives. Figure 1 shows a basic setup. (If you're setting up a cluster that has many group and resource dependencies, you might want to check out the additional resources in the sidebar "For More Information About MSCS," page 25.) To achieve this setup, you use a RAID-1 mirror to configure the Q drive as two drives, and you configure the D and E drives as RAID-5 arrays with four disks. You install virtual SQLCluster1 on the D drive and SQLCluster2 on the E drive to achieve the active-active configuration.

Under normal operations with this setup, SQLCluster1 runs on NodeA, which owns DiskD, and SQLCluster2 runs on NodeB, which owns DiskE. If NodeB fails, SQLCluster2 provides failover to NodeA, which brings DiskE online.

The last step in planning your cluster is to make sure you have set aside enough IP addresses. In MSCS, IP addresses are like potato chips—you can't have just one. You need at least seven addresses to set up an active-active SQL Server cluster that follows Microsoft's default recommendations. You need four IP addresses for the NICs. (Each node in the cluster uses two NICs. One NIC handles intracluster communication in the private high-speed network. The other NIC handles communication with the outside network and serves as a backup for intracluster communication.) You need one IP address for the cluster because it constitutes a virtual server. Finally, you need an IP address for each virtual SQL Server machine. You need even more IP addresses for any file shares, printers, or other applications you intend to place on the cluster.

You must place all the IP addresses, with the exception of those for the private network, on the same subnet. The two IP addresses for the private network must be on the same subnet as each other but on a different subnet from the LAN. Administrators often use private addresses for the cluster's private network. (For more information about private addresses, see Request for Comments—RFC—1918 "Address Allocation for Private Internets," ftp://ftp.isi.edu/in-notes/rfc1918.txt.)

Installing Your Cluster

Installing SQL Server in a cluster differs from installing Exchange Server 5.5, Enterprise Edition in a cluster. Exchange's setup program automatically senses that you're installing Exchange in a cluster and makes the necessary adjustments. To configure SQL Server in a cluster, you first run the standard SQL Server setup utility to install SQL Server on the nodes and then run the SQL Cluster Setup wizard to enable SQL Server's clustering capabilities. SQL Server's Read Me file discusses this setup process in detail.

Don't Let These Gotchas Get You

In our various installation endeavors, we experienced a few gotchas. We hope you can learn from our experiences.

Dealing with open transactions.

Suppose JoeUser executes an update statement in the CustomerDB database running on the VSQL1 virtual SQL Server machine, which NodeA of DBCluster manages. Someone accidentally spills a can of soda down the back of NodeA, creating a big puff of smoke and the immediate failure of NodeA. DBCluster senses that NodeA has failed and brings the VSQL1 resource online on NodeB.

Just like in the startup of a standard SQL Server machine, automatic recovery of each user database is one of the first tasks that occurs in the startup of the virtual SQL Server machine. During automatic recovery, JoeUser's update transaction rolls back because NodeA failed during the update, leaving some rows unchanged.

No user can access CustomerDB for any reason until automatic recovery is complete. Typically, the automatic recovery process completes in just a few seconds. However, JoeUser was updating the balance due amount for 100,000 customers in one transaction, and NodeA failed while updating record number 99,999. Updating 100,000 rows in one transaction takes a while; rolling back all that work during automatic recovery takes even longer. In this case, the automatic recovery process takes 10 minutes. (In reality, automatic recovery might be quicker or slower, depending on several factors, such as how fast your I/O subsystem is.) The CustomerDB database is unavailable for 10 minutes—a far cry from the 35-second average for virtual SQL Server recovery that Microsoft touts. (Technically, the virtual server comes back online in about 35 seconds, but specific databases are unavailable until they've been fully recovered.)

Using TCP/IP sockets.

Stopping and starting SQL Server.

To avoid these unwanted restarts, you must stop and start SQL Server with the Cluster Administrator utility rather than the Control Panel or the standard Service Control utility. If you use administrator scripts that stop and start SQL Server from the command line, MSCS provides a command-line interface for managing cluster resources.

To avoid these unwanted restarts, you must also make sure you don't select Autostart at boot when installing SQL Server. The cluster needs full control over SQL Server service operations. Starting the normal services at boot time prevents the cluster from functioning properly. By default, MSCS brings the SQL Server services online when the cluster starts.

Setting up server account privileges.

Log on as a service

Imagine that during testing, AmyAdmin notices the VSQL1 SQLExecutive service runs smoothly on NodeA (the primary node) but doesn't start during failover to NodeB. Similarly, VSQL2's SQLExecutive service runs smoothly on NodeB but doesn't start during failover to NodeA.

After a little research, AmyAdmin realizes that the SQL Cluster Setup wizard grants privileges (including the Log on as a service privilege) at the machine level on a member server and not the domain level. AmyAdmin had initially configured VSQL1's SQLExecutive service to run under the SQLExec1domain account and VSQL2's SQLExecutive service to run under the SQLExec2 domain account. Therefore, the wizard granted SQLExec1 the Log on as a service privilege on NodeA during the initial SQL Server install but didn't grant this right to SQLExec2 on NodeA. The VSQL2's SQLExecutive service didn't start on NodeA because it didn't have the necessary privileges.

You can avoid this problem three ways. First, you can modify the user rights with the User Manager for Domains. Second, you can modify the user rights with the Control Panel's Services applet. For the VSQL2's SQLExecutive service on NodeA, you need to reset the password to its current value, which has the effect of granting the service account the necessary Log on as a service privilege on the local machine. In other words, if the current password is test, resetting the password to test grants the service account the proper Log on as a service permission. You also need to reset VSQL1's SQLExecutive service on NodeB. Third, you can install SQL Server on two domain controllers rather than on member servers.

Naming nodes.

Setting memory utilization.

In SQL Server 6.5, setting memory utilization in a cluster can be problematic. Suppose AmyAdmin runs VSQL1 and VSQL2 on SQLCluster. This cluster consists of NodeA and NodeB, each of which has 128MB of physical memory. He dedicates each node to running SQL Server.

Through careful analysis, AmyAdmin determines that optimal performance occurs when she allocates 96MB of memory to SQL Server on each node. What happens under the current memory utilization setup if NodeB fails, forcing VSQL1 and VSQL2 to run on NodeA? Together, the two instances of SQL Server require 192MB to perform optimally, but NodeA has only 128MB of physical memory. The 64MB discrepancy will cause excessive paging, drastically reducing SQL Server's performance.

To deal with this problem, Microsoft suggests setting the memory parameter of each virtual SQL Server machine so that paging isn't excessive when both virtual servers run on the same node. Following Microsoft's advice, AmyAdmin sets SQL Server's memory utilization to 48MB for each server. However, she soon discovers that, although the amount of paging is acceptable during failovers (i.e., VSQL1 and VSQL2 run on NodeA), SQL Server's performance suffers during normal operations (i.e., VSQL1 and VSQL2 run on different nodes).

A better alternative is to optimize for normal operations. For example, in AmyAdmin's case, you configure the memory for 96MB on each node. However, you need to make sure the paging file on each server is large enough to handle all the virtual memory that SQL Server might request.

Another possible solution is to create a series of scripts that run as automatic startup procedures under each virtual SQL Server node. The scripts can check how many virtual SQL Server machines are running on the nodes and reset the memory parameters for each SQL Server machine if one node is handling both SQL Server instances. You can use the xp_getnetname extended procedure to obtain the name of the underlying NT machine that the virtual SQL Server is running on.

In SQL Server 7.0, you don't have to worry about the memory setup. SQL Server 7.0 dynamically monitors and sets memory utilization. After a failover, both SQL Server instances simply reduce their respective working sets until paging stops.

Installing client utilities.

For example, let's say VSQL1 is running on NodeA of a cluster. On the E drive, you install SQL Server utilities (e.g., SQL Enterprise Manager) on a shared SCSI disk, DiskE, which only one node can access at a time. NodeB can't access the SQL Server utilities if NodeA has the E drive open.

Similarly, you might have problems accessing the SQL Server utilities on each node during failover if the nodes use different drive letters to refer to the array. The drive letters might not be available, especially in an active-active configuration. You can work around this problem by ensuring that the nodes use the same drive letters to refer to the array devices.

Setting the amount of RAM for TEMPDB.

Setting up disk resource dependencies.

cluster resource group

For example, suppose an active-active cluster's shared disk subsystem is an external RAID cabinet with four configured arrays. You assign these arrays to the D, E, F, and G drives. The D and E drives contain the database and log devices, respectively, for VSQL1; the F and G drives contain the database and log devices, respectively, for VSQL2. You install VSQL1's master device on the D drive and VSQL2's master device on the F drive. For the log devices on drives E and G to failover in conjunction with their affiliated database devices, you must manually add the E drive's physical disk resource to the VSQL1 cluster resource group and add the G drive's physical disk resource to the VSQL2 cluster resource group.

Configuring dependencies and devices.

In the example just given, you need to configure the dependencies correctly so that the cluster understands that the D and E drives must be online for VSQL1 to start. To configure the dependencies, you use the Cluster Administrator utility to reconfigure the MSSQLServer service resource and add the E drive as a dependency of the MSSQLServer service.

If you incorrectly configure devices, SQL Server can't operate. Consider the following two examples.

AmyAdmin creates a database device on the C drive, which isn't a shared drive. During a system failure, the virtual server attempts a failover to the other node, but the database device is unavailable and SQL Server can't start. What went wrong? AmyAdmin didn't store all the devices for the virtual server on a shared drive. If the virtual server can't manage the devices, it can't achieve failover.

AmyAdmin sets up an active-active configuration in which she affiliates the D drive with VSQL1 and the Q drive with VSQL2. AmyAdmin successfully creates a VSQL1 device, but when she tries to create another, she can't. What went wrong? When AmyAdmin created the first VSQL1 device, both virtual server instances were running on NodeA. Because NodeA owns both cluster resource groups, SQL Enterprise Manager could access the Q drive and Amy could create the device. When AmyAdmin tried to create another VSQL1 device, the two virtual server instances were running on separate nodes. As a result, SQL Enterprise Manager couldn't access the Q drive and AmyAdmin couldn't create the second device. You must store all devices for a virtual server on shared drives that are in the same cluster resource group as the affiliated SQL Server services.

Removing a virtual server.

Remove virtual server

Using super fast drives.

wait for delay "00:00:01"

Applying SQL Server service packs.

Send Us Your Tips

If you're getting ready to install your first SQL Server-based MSCS environment, we hope this material makes setting up your first SQL cluster a little easier. If you've already implemented SQL Server-based clusters and you have any tips for installing and managing SQL Server in a cluster, share them with us.