|Executive Summary: SteelEye's clustering and high availability solution for SQL Server is worth a look if you don't need highly granular data rewind capabilities.|
LifeKeeper Protection Suite for SQL Server
Flexibility, supported by an extensive feature set, is a hallmark of LifeKeeper Protection Suite for SQL Server from SteelEye Technology. The solution provides high availability and protects the data in your SQL Server-based applications by letting you create failover clusters comprising local and remote standby servers. I tested version 6.1.2, which combines SteelEye Data Replication (SDR) volume replication with the LifeKeeper for Windows high availability feature set, with added support for SQL Server. Installation of the two products is integrated with a single setup routine, although the two products are separate services and have separate documentation and management interfaces.
Key features include block-oriented synchronous or asynchronous volume replication, automatic and manual failover modes supporting shared and replicated storage on both physical and virtual servers, and a new continuous data protection (CDP) function within the recovery feature set. LPS-SQL supports many standard storage types, including iSCSI, Fibre Channel SAN, shared SCSI and Windows volumes with Volume Shadow Copy Service (VSS) snapshots. Windows fault tolerant disk sets are an exception, and unsupported. LPS-SQL supports x86 and x64 (but not IA64) versions of Windows Server 2003 and Windows 2000 Server and protects SQL Server 2005 and SQL Server 2000.
How LPS-SQL Works
LifeKeeper core components include a configuration database, a communications manager, an alarm interface used to trigger events, and a control interface to locate the correct scripts used for recovery actions. LPS-SQL stores information about the protected resources you define in a configuration database. The communications manager coordinates inter-server communications. Internally, LifeKeeper uses scripts to perform recovery actions, as managed by the control interface.
A LifeKeeper cluster consists of two or more interconnected servers. The same version of SQL Server must run on all servers within a cluster. Server hardware configurations need not be the same, as long as servers in the cluster have the capacity to handle the failover load. Although a cluster might include servers local to or remote from the primary application server, configuring failover to a remote SQL Server instance currently involves a minor workaround and the use of a protected DNS resource. In this scenario, users and applications access the SQL Server instance by using a DNS host name that’s added automatically. At failover, LPS-SQL communicates with an authoritative DNS server, altering the IP address associated with the host name to the IP address of the standby server that's taking over the SQL Server processing. Because users will continue to access the SQL Server server at its old IP address until the DNS host name’s TTL expires, it might take significantly longer for users to begin to access the new server. In my tests, it took about five minutes. SteelEye plans full support for this configuration in the next release of LPS-SQL.
LifeKeeper requires at least two communication paths between the servers—one or more for LifeKeeper heartbeat communications used to sense whether a server is working, and one or more for normal server communications. SteelEye recommends a private IP network connection for the primary heartbeat path, and also supports shared disk and RS-232 serial connections.
The LifeKeeper GUI runs as a Web client under one of several supported browsers with Java installed. LifeKeeper also installs an executable form of the GUI client, which Web Figure 1 shows, for local administration of a LifeKeeper cluster. Both clients look and work the same. LifeKeeper supports three access security levels for the GUI: Administrator, Operator, and Guest. In addition to configuring protection, the LifeKeeper GUI is also the interface you use to perform manual failover, manual fail-back after an automatic or manual failover, and data rewind. Using the LifeKeeper GUI, you create a hierarchy of protected resources that LifeKeeper will fail over to a standby server. Volumes, IP addresses, DNS host names, and SQL Server instances are examples of protected resources. When protecting multiple instances of SQL Server, you define a hierarchy for each instance of SQL Server and select the databases within that instance that you wish to protect. Each server within a cluster can simultaneously be the active server for one or more SQL server instances and a standby server for other instances.
Hands On with LPS-SQL
You begin by installing LPS-SQL on each SQL Server serverin your failover cluster. I installed LPS-SQL on two Windows Server 2003 systems hosting SQL Server 2005. LPS-SQL requires that SQL Server databases—including system databases—be located on a volume that it can fail over to a standby server. LPS-SQL supports databases on either a replicated volume or a volume shared via shared SCSI channel or SAN. For a shared volume, LifeKeeper insures that only the active server will write to it. For replicated volumes, LifeKeeper insures that only the replication engine has access to the replicated volume. I chose the replicated volume option and configured both servers with a volume dedicated to replicated data. Each system had two Ethernet cards, allowing redundant heartbeat paths. I connected one NIC to a network I used only for heartbeat traffic and the other to my public network.
Configuring LPS-SQL was the next step, a process described both in the Evaluation Guide, and in an Administrative Quickstart Guide that's part of the LPS-SQL online Help system. Working with the executable version of the GUI on the system I had designated my primary SQL server server,logged in with an administrative User ID (UID) and configured heartbeat communication paths between my primary and standby servers on both networks.
Next, I configured protected resources in support of SQL Server failover. SQL Server failover in LPS-SQL is designed around a switchable IP address--an IP address that LPS-SQL reassigns to the standby server when failover occurs and that users and applications use to access the protected instance of SQL Server. This requires that primary and standby servers be located on the same logical network segment. When LPS-SQL creates a protected resource--an IP address in this case--it first creates it on the server you're logged into using the administrative interface. Then LPS-SQL “extends” the protected resource to other standby servers, thereby creating the resource on the standby server in a state ready to accept failover. When I created the IP address resource, LPS-SQL added the IP address to the IP configuration of my primary server, and in the extend step prepared the way for the IP address to fail over. I also created a DNS host name for the IP address, which I used in my testing.
Protecting the SQL Server instance was next. LPS-SQL requires that SQL Server operate in Mixed mode (authentication), and it uses a SQL Server UID to authenticate. Both my primary and standby servers had default configurations of SQL Server, with system and application databases on the server system volume.
When creating the protected resource on the primary server, LPS-SQL saw that the databases were on the system volume. Since system volumes can’t be part of a failover hierarchy, the wizard relocated them to a volume I designated. This is a great new feature in LPS-SQL. Without LPS-SQL’s ability to automatically migrate SQL Server’s system databases to a volume eligible to be a protected resource, you’d have to do this manually, which isn't a trivial task.
When extending the protected SQL Server resource to the standby server, LPS-SQL asked whether the volume was shared or needed to be replicated. I specified replicated, and LPS-SQL configured SDR for the volume and altered the SQL Server configuration on the target standby server for the new system database locations on the replicated volume.
The default configuration for LPS-SQL protects only the master database on the primary server, and the Continuous Data Protection (CDP) feature is turned off. This means that out of the box, LPS-SQL is configured to protect only the system databases, so you need to tell LPS-SQL which other databases you want to protect. This is a much better default than a “protect all,” especially when large databases might be present in the instance. Using the Web and Java-based administrative interface from a Windows XP system, I enabled protection for several of the server’s databases.
I also configured an application running on another system to use the DNS host name I had defined for the switchable IP address assigned to the protected SQL server server, which provided a stream of data for testing. I performed a manual failover, then a manual failback. Next, I turned off the active server, forcing a failover. In each case, the failover completed quickly and in a minute or so my application was again accessing the database at its new location.
During my initial configuration of LPS-SQL, I also enabled Rewind—LPS-SQL’s CDP feature. Since LPS-SQL monitors protected data at the volume level, you configure Rewind at the volume level, by providing a location on a non-replicated volume for the Rewind log file, and by enabling Rewind on each active and standby server that contains a replicated instance of the volume resource on which you might want to perform a rewind. Note that because LPS-SQL monitors and logs I/O activity at the level of a volume rather than a database, LPS-SQL isn’t aware of SQL Server transactions and doesn’t make use of SQL Server transaction logs when rewinding data.
To return data to an earlier state, you perform the data rewind on a standby server. LPS-SQL pauses mirroring to the standby server, and offers the option to pause application activity on the active server. When you’ve rewound the data to the desired state, you can either manually restore the affected data to the active SQL Server server or have LPS-SQL switch over to the standby server that now holds the valid data, making it the active server.
I tested the data-rewind feature by generating application activity over about half an hour, then restored the volume to an earlier state using the Rewind wizard within the LPS-SQL console. LPS-SQL provided me with about two dozen time-stamped rewind points for the interval.
The Rewind wizard doesn’t display all available rewind points. Because LPS-SQL records every write in the rewind log, the number of potential rewind points can be quite large. Instead, LPS-SQL allows you to manually enter a specific date and time you wish to rewind to; and to create a bookmark within the rewind log for specific points in time that simplify selection of that rewind point. I think I would prefer an option to filter and search through all the timestamps in the rewind log.
I selected an early point and allowed the recovery to finish. To check the data, I needed to start the SQL Server service on the standby server. (As I stated earlier, LPS-SQL pauses replication and mirroring to the standby server where you are performing the rewind.) I then was able to examine the data using SQL Server Management Studio.
To test the iterative rewind option, I decided to move the recovery point forward in time—LPS-SQL lets you move forward and backward within recovery points until you are satisfied with the state of the data. When the second rewind completed, I declared the data good, and took the LPS-SQL option to switch application processing over to the standby server. The switchover was completed in about a minute.
How Granular Are Your Data-Rewind Needs?
LifeKeeper’s support for failover to remote systems and its ability to incorporate a variety of hardware configurations into the same cluster makes it a very flexible product to work with. Overall, the failover and rewind operations worked well, and I liked the ease of configuration and the ease with which the data-rewind feature worked.
The documentation is well organized, though at present it doesn’t integrate the data replication and high availability components as well as it might. It also doesn’t always provide the depth of explanation you might want.
If you're looking for a product that presents very granular data-rewind points, perhaps LPS-SQL might not appeal to you. However, if you're looking for an easy-to-implement, high availability product with a decent—not great—data-rewind feature, I suggest you bring LPS-SQL in for a test.