Massive Speed and High Availability

Question: William asks, “I have a SQL Server which needs both high availability (HA) and high-speed storage. Even SAN storage with SSD-based hard drives isn’t providing the performance levels that we are looking for, while still getting the failover cluster HA solution that we need for our SQL Server 2008 R2 database. Any suggestions?”

Answer: One of the things which people are always trying to figure out is how to get high-speed disks and HA all in a single package. Normally these are two mutually exclusive things. If you want massive speed then you need to look into Fusion ioDrive so that you can get the lowest amount of latency between the storage and the CPU as possible (nothing is really going to be faster than pushing data from the CPU down to the PCI slot and having the storage be sitting right there). The high availability story for SQL using Fusion-io was basically non-existent due to Microsoft’s requirements for shared storage (yes, I’m ignoring AlwaysOn Availability Groups on purpose).

Recently, I had the pleasure of putting a solution through its paces. That solution is called DataKeeper Cluster Edition from SIOS Technology. DataKeeper Cluster Edition is a block level replication software that is designed to replicate data changes between servers, turning local disk into cluster-able storage. This gives you the massive speed performance that comes with Fusion ioDrive and the high availability that you’ve come to expect from traditional failover clusters.

Some Numbers

Starting with the reads, we can see that the disks were peaking at just over 12,000 IOPS while in a highly available configuration (which, by the way, is supported for any edition and version of SQL Server that supports failover Clustering). The average performance is much lower than the 12,000 IOPS because the workload is not the same as a constant physical IO workload. Some of the requests would have been serviced from the buffer pool, which would reduce the number of IOs. Therefore, being able to push 12,000 IOPS of reads with this workload is amazing.

During this same time period we can see that the writes on the disks weren’t idle. The SQL Server was pushing a peak of over 5,000 IOPS. This wasn’t a single burst of write IO either. Looking at Figure 2 below we can see that the average IO for the test run is over 1800 IOPs and the test run time period was over a two hour period.

Figure 2: Write IOPs during test run

Looking at the entire test, not just the two hour window, which was a 13.5 hour test, we can see that the Fusion ioDrive was being pushed even harder earlier in the test cycle to over 6,000 IOPs per second.

Figure 3: IOPs during the full test run

Now looking at just the number of IOPS isn’t the total story. Looking at the response time of the disks is critical to see just how well the storage is working. The response time during the entire test run tells a pretty solid story. The maximum response time for reads during the entire 11-hour test run was 0.007 seconds or 7 milliseconds while the maximum response time for writes during the test run was 0.001 seconds or 1 millisecond. The performance monitor data shown in Figure 4 shows the response time graph data with the reads shown in red and the writes shown in green. The scale has been reset to show the peaks and valleys of the response times.

While the reads looks very spikey those spikes are only peeking out at 7 ms with the bottom of the spikes being in the less than 1ms range. This can be confirmed by looking at the green line of the writes which maxes out at 1ms at its peak.

Figure 4: Response Time of Reads and Writes

Looking at disk numbers means that we need to take the buffer pool into account as well. The more data that lives in the buffer pool, the more data that we can process through the system and the less IO stress we put onto the disks. In the case of this test system during our test we can see that the Page Life Expectancy dropped down as low as 7 seconds which means that we were putting a good deal of stress on the disks at that point which can be seen in the PLE graph shown in Figure 5.

Figure 5: Page Life Expectancy

Comparing these numbers to the other published data on Fusion-io, everything looks right in line. But when looking through these numbers and graphs we have to remember that with this test system we have two servers, both with Fusion ioDrives running in a traditional active-passive cluster using SIOS DataKeeper software, giving our system the same level of high availability that we would have from a more traditional SAN-based solution, but with the high-speed performance of the Fusion ioDrives.

This configuration, in this case a single 3U unit, which contains two physical servers each with the own memory, CPU, Fusion ioDrive, DataKeeper software, etc., allows for true high availability. With this configuration, the application that uses this database can maintain the levels of uptime that are required for mission critical line of business applications while still providing the level of performance required by the application.

Comments

Plain text