Attaining Availability—Avoiding Failure

If you don't consider disaster planning and availability part of your network management strategy, consider Stratus Computer's findings from a recent survey of Fortune 1000 companies. In 1992 (the last year such research data was available), computer downtime cost US businesses more than $3.8 billion in lost revenue and worker productivity. This downtime equals an average hourly revenue loss of $78,000 and approximately 38 million worker hours annually, or $444 million in wages.

A sudden loss of a mission-critical server can be financially disastrous. In most companies, just the downtime before recovery can be too costly. Still not convinced? According to "Down But Not Out" (HP Professional, September 1994), "The average company loses two to three percent of its gross sales within 10 days after losing its data processing, and critical business functions cannot continue for more than 4.8 days without a recovery plan in progress. Half of the companies that do not restore their data center to operation within 10 business days never fully recover. Ninety-three percent of the companies lacking a recovery plan are out of business within five years of a major disaster."

Despite these claims, few companies plan ways to prevent or mitigate losses. To protect the bottom line, companies need to evaluate potential losses and implement an appropriate availability scheme for their network.

A good starting point is to review the availability mechanisms that Windows NT Server supports. These mechanisms include data backup, uninterruptible power supplies (UPSs), and redundant systems. With an understanding of the options, companies can make informed decisions about implementing the appropriate levels of protection for their LAN and WAN and be better prepared for the next level of availability--ensuring server availability with server redundancy.

LAN Availability
Downtime can result from disasters such as fires, floods, power failures, and--let's face it--users. Users frequently (yet accidentally) delete critical files or stumble onto control-key combinations that can restructure databases and wreak havoc throughout a company. So when planning a network, you need to consider availability, backup, and disaster recovery.

Most network administrators implement availability by a mirroring of the primary system. This redundant system eliminates single points of failure. Fortunately, NT Server comes with support for tape backup, UPS, and redundant systems.

Critical Data and Programs
Data backup is at the forefront of availability. The backup process copies important information onto magnetic tape or disks. Without backups, vital data, complex application and network configurations, customized setups, and user passwords and IDs are difficult and expensive--perhaps even impossible--to re-create. Backing up information is also important because of its changing nature. Compaq reports that as much as 40% of its company data changes every month.

To restore a system after a disaster, you need to back up all data and programs and determine whether certain users or groups have special backup needs. For example, an accounting group may require data backups beyond the regularly scheduled full-system backups. For information on NT-native backup programs, see Bob Chronister, "System and Enterprise-wide Backup Software," Windows NT Magazine, April 1996.

UPSs
Most systems improve OS performance by writing changes to RAM before writing them to disk (write-back caching). When a power interruption turns off or resets a computer, you can lose cached information and potentially corrupt data. Because the server processes most data on the network, any power fluctuations can adversely affect data flow to and from client workstations.

Most system administrators equip critical servers with UPSs in case of a power failure. But don't overlook key network connection points such as main servers and LAN/WAN peripherals (routers, bridges, hubs, and concentrators). Site-to-site and wide-area networks are susceptible at these points, so use UPSs to maintain data flow and processing stability among servers.

What about client workstations? In a peer-to-peer network, any workstation can be the server to any other workstation on the network. Peer-to-peer activity greatly increases the data flow on the network to each workstation, but makes them susceptible to brownouts and blackouts. So, you need UPSs at client workstations. This way, if you lose power, you have time to save active files and do an orderly shutdown. For more information, see Larry McClain, "Roundup of UPS Products for Windows NT," Windows NT Magazine, November 1995.

Redundant Systems
With availability solutions, two are always better than one. Redundancy lets a system gracefully handle a failure in any component for which a duplicate is available. Sophisticated systems use the duplicate component to balance the processing load until a failure occurs. Then, the remaining component picks up the full load with a decrease in performance but little or no interruption in service. HP's research on server failures shows server downtime most often occurs from system hangs when the server or network OS freezes or stops running, from power failures to the server, and from hard drive and memory failures.

Disk redundancy solutions are disk mirroring, disk duplexing, and disk arrays. With mirroring, two disks (or two partitions on different drives) on one controller are copies of one another. A system write operation writes the data to both disks, so they are always synchronized. If the primary disk fails, no data is lost because the secondary disk has an exact copy of the data on the primary disk.

A caveat to mirroring is that two disks don't improve performance. Usually, performance worsens because the disk controller has to write every operation twice.

To solve this problem you can use disk duplexing--disk mirroring with another adapter running the secondary drive. Duplexing provides protection for both disk and controller failure and improves disk I/O performance over mirroring. Duplexing doesn't adversely affect performance because both disk controllers perform write operations simultaneously. And using two controllers removes a potential single point of failure within a system.

The third form of disk redundancy is RAID. A disk array is a group of disk drives, and each drive stores information in parallel with the others. Redundancy relies on parity, a mathematical calculation that lets the disk array reconstruct any corrupt or missing information if one disk fails.

The six levels of RAID are RAID 0 through RAID 5. Each level offers various mixes of performance, reliability, and cost.

RAID 0: Disk striping (a disk array that implements striping without any drive redundancy)

RAID 1: Disk mirroring or duplexing (two drives storing identical information, mirroring each other)

RAID 2: Redundancy through hamming code (extra check disks that detect and correct single-bit errors and detect double-bit errors)

RAID 3: Striped array plus parity (one redundant check disk for each group of drives)

RAID 4: Independent striped array plus parity (a disk array architecture optimized for transaction processing applications)

RAID 5: Independent striped array with distributed parity (storing data on the equivalent of one disk, but distributing the check data over a group of drives)

NT supports only RAID 0, 1, and 5. Although you can mix and match RAID 0, 1, and 5 across the disks in a system under NT Server, consider only RAID 1 or RAID 5 or a combination. RAID 0 doesn't provide data redundancy.

With RAID 5, if one drive fails, the array continues to function. The system reconstructs the missing or damaged information with the parity information on the other disks in the array. In RAID 5, the controllers write data one segment at a time and interleave parity bits among the assigned disks. Table 1 on page 72 lists RAID 5 characteristics. For a glossary of RAID-related terms, see the sidebar, "RAID Tech Talk," above.

Availability with NT
In addition to data backup and UPS devices, RAID needs to play an important part in your availability scheme. With NT Server, you can implement RAID in either hardware or software. So your decision comes to either buying the solution from a RAID hardware vendor or building it using NT's software features.

With RAID in hardware, the disk controller creates and regenerates the redundant information in one of two ways: the host bus-based system or the SCSI-SCSI system. In a host bus-based system, the disk controller contains a CPU and firmware for calculating parity and striping data. Most host bus solutions are on EISA and PCI bus systems. The faster the host bus, the faster the RAID subsystem.

The SCSI-SCSI RAID subsystem alternative consists of an external drive chassis and a device similar to a host-bus adapter. This external chassis connects to the host system via a standard SCSI cable and appears to the system as one or more SCSI devices. The SCSI-SCSI RAID subsystem doesn't require a device driver on the host, and you can use the subsystem on any system with a SCSI bus.

NT Server's software capabilities let you mirror the stripe on one controller to a second. Mirroring across controllers removes the controller as a single point of failure. Two disadvantages of a purely software-based RAID implementation are performance and reliability.

Performance: With RAID software, the system CPU performs extra work such as calculating parity in RAID 5. With hardware-based RAID, the controller calculates parity data and duplicates disk writes, freeing the system CPUs to handle the usual processing tasks.

Reliability: Protecting a drive that contains the OS from drive failure is difficult in software-based RAID because the OS must boot before the protection is available. In contrast, the hardware-based RAID subsystem protects data as the system boots. If a drive with the OS fails, the controller reconstructs the OS at boot time.

Combining software- and hardware-based RAID under NT Server provides the best of both worlds. If a hardware-based RAID controller fails, the system is down until you replace the con-troller. If you install two controllers, you can create a RAID 0 stripe on each controller.

RAID Solution with NT Server
To set up RAID, you need NT Disk Administrator, a graphical utility that manages disk resources, including drive partitioning, volume creation and deletion, and software RAID configuration. You can make the disk subsystem more redundant with multiple disk controllers. NT Setup lets you incorporate new disk controller drivers. You use this utility before you configure drives with Disk Administrator. With SCSI drives, NT can isolate and avoid bad disk sectors. In this way, NT can recover data from redundant bad sectors and write the information to good sectors.

Increasing Server Availability
Disk mirroring, duplexing, and RAID protect your system from disk failure only. So what happens if a CPU dies? To increase availability, the next step is server redundancy at the OS level. To meet this need, Microsoft teamed with HP, Digital Equipment, Compaq, Tandem, NCR, and Intel to implement a technique called clustering.

Clustering refers to a set of loosely coupled, independent computer systems that work together and behave as one system. Clusters offer high availability through component redundancy, so when a component or server fails, the cluster continues to provide service. Digital Equipment pioneered clustering in the mid-1980s on VMS.

A cluster of NT servers provides common, highly available services to PC and workstation clients. You manage an NT cluster as one secure entity. You can easily add incremental processing, I/O, and storage capacity to the cluster domain. With file services, clients access remote directories in the cluster through the File Manager, the same way they access any directory in Windows 3.x or NT. The location of the directory server is transparent to the user.

A cluster can be a simple set of standard desktop personal computers connected with Ethernet, or a sophisticated hardware structure with high-performance symmetrical multiprocessing (SMP) systems interconnected with a high-performance I/O bus. You can add systems to the cluster as needed to process more or to handle more-complex client requests. So if one component in a cluster fails, the system can automatically disburse the workload of the failed component among the surviving components.

Microsoft plans to deliver clustering, under the code name Wolfpack, in two phases. For a comprehensive analysis of cluster technology and Wolfpack in particular, see Mark Smith's "Closing In on Clusters," page 51.

If you can't wait for Microsoft's clustering solution, check out Digital Equipment's Clusters for Windows NT Server (see Joel Sloss, "Digital Clusters for Windows NT," page 63). The product doesn't require hot and cold standby, proprietary hardware, interconnects, or special versions of NT, and Digital Equipment has optimized it for client/server computing.

Be Prepared
Count on this: At some point your network will go down. Whether your downtime results from the wrath of nature or user error, you will lose data, servers will crash, routers and bridges will fail, and communication lines will fall. Although you can't make an NT Server or network fail proof, you can make it failure resistant. So plan for disaster, and defend against it. Your best defense is a solid offense--availability techniques. Keep in mind that availability is a means to an end and not an end in itself. You need to develop plans and procedures for recovering from failures before you have one.

Please see the sidebar "RAID Tech Talk".

Digital Clusters for Windows NT Server

Digital Equipment * 800-344-4825
Web: http://www.windowsnt.digital.com/clusters/default.htm
Price: $995 per server

RAID Vendors

Mylex * 800-776-9539
Web: http://www.mylex.com

Micropolis * 800-395-3748 or 818-709-3325, option 4, for fax-back information
Web: http://www.micropolis.com/How_To_Buy.html (to find your local sales office)

Seagate Technology * 408-438-6550
Web: http://www.seagate.com

Comments

Plain text