A little while ago I was presented with the chance to do some testing on a brand new storage array that has come to market from Pure Storage. A lot of the time when doing tests like this the vendor wants to connect to their machines and play with some dummy data and see what sort of numbers that you can get. This wasn't the case with Pure Storage, they offered me a unit to rack in my data center so that they could see the numbers that the array was able to handle while running actual production workloads. Below you will find my findings which I came up with from the Pure Storage array.
The environment the array was connected to plays a very important note in the testing that I did, as the array was connected to a pair of 4 Gig Cisco fiber channel switches. As you begin reading through the numbers below you'll begin to see some less than stellar numbers being reported by SQLIO. These numbers aren't an indication of a problem with the Pure array, but with the fact that we were pushing the ports on the 4 Gig switches to 100% while doing the testing. While servers that were being used have 8 Gig FC ports, and the storage array has 8 Gig FC ports, the Cisco switches in the middle were only 4 Gig ports so this presented a severe bottleneck which couldn't be overcome without upgrading the switches. This shows that just because the storage can handle really high end workloads, which you'll see as you continue to read below, if every component can't handle the workload you'll have performance problems which you may end up trying to blame on the wrong part of the environment.
The first thing that I noticed with the unit was there was actually some thought put into the packaging. Something which I thought was pretty cool before we even got the unit racked was that the unit actually included the correct sized screw driver so that you didn't have to worry about stripping out the screws when racking the unit by using the wrong sized screw driver.
Getting the unit racked was pretty easy. The unit that the sent over was a two shelf unit with one shelf being the controller and the second shelf being the actual storage. According to the Pure Storage website they sent over an FA-310.
Once the system was up and running I started seeing some pretty decent numbers right away. The first thing that I did was to copy the backups onto one LUN on the array, then I did restores from that LUN into the SQL Server. The restores were running at about 180 Megs a second for the larger databases (according to the SQL Server).
You are not using all the space you think you are!
One thing which is really cool about the Pure Storage array is that it does not only data storage, but it also does pattern replacement, data deduplication and data compression all done inline. The biggest data which I was able to get onto the array was some encrypted data so this was about the worse case situation for the array. I was pretty shocked that it was not only able to do some data deduplication and compression, but it was able to do a lot of data deduplication and compression. I loaded up about 68 Gigs worth of mostly encrypted data onto the unit. The total data reduction that I was able to get was 2.4 to 1. The array showed a 5% size reduction due to pattern removal, a 16% size reduction via data deduplication and a 47% size reduction via data compression. The more data reduction the array can do the better. This means that you can store more data on the array, and you can get better performance from the array because there are less blocks to read from the array.
There were some misleading numbers as well once the databases were restored that showed themselves. After I restored the database I grew out the transaction log to the size that it should have been. But because 99% of the file was all 0s at this point the array showed a massive amount of data reduction. As time went on this number got a lot smaller, back down to a much more realistic number. This wasn't a failure of the array but more a situational happenstance as the data on the disk was all 0s the array did exactly what it was supposed to do.
Numbers, Numbers, Numbers
So lets talk about some workload on the unit. To put these numbers into some perspective I've got 22 256Gig flash drives in a single shelf which I'm connected to via four 4 Gig fiber channel cables. For all the tests that I'm doing I'm looking at 64k IO (as that's what SQL does most of the time). Now I didn't use SQL Server to push IO to the unit, because frankly every SQL Server workload I threw at the array didn't push the array hard enough to make it blink.
So looking at some raw SQL IO numbers using a 1000 Meg test file and 2 threads I was able to push ~6250 IOs per second to the array. And this was happening consistently every time I ran the test. There was some small deviations where I got numbers like 6245 and 6255 but everything was right around that same number. These same workload tests were giving me 390 Megs/second of throughput. These first tests were short, just 10 seconds which simulates a bursty application. Making these numbers even more impressive is that these are sequential writes, not exactly the kind of thing that flash drives are known to be good at. Increasing the length of the test didn't do anything to hurt the system either. Moving up to 90 seconds the system still showed ~6250 IOs and 390 MBytes of throughput per second. Now if you've taken my Storage and Virtualization pre-con or class (via SSWUG) you'll know that the max amount of bandwidth that you can pump through a 4 Gig HBA at any one time is ~380 MBytes per second so at this point I'm maxing out my 4 Gig HBAs.
Switching to a random workload with SQLIO didn't slow the array down at all. It's still handling the workload at ~6250 IOs and 390MBytes per second. The latency histogram from SQLIO showed some pretty impressive numbers on top of this. No matter what test I ran with SQL IO the histogram looked basically like the one shown below.
SQL IO Histogram
Making it harder
So that that we've pounded on the system with a file full of white space lets try again, but this time with a file full of data. For these next tests I took a database backup file which was compressed and full of encrypted data and used that file as the test file for SQL IO to run against.
Running my test against this database file which has actual data in it, the numbers look shocking similar to what they were before. ~6250 IOs / second pushing about 390 MBytes per second to the disk. The data histogram looked a little worse than before with only 75% of the IO being responded to within 2ms and 14% of the IO being responded to within 3ms over the duration of the 90 second test window. Now because I like to make storage vendors feel a little pain, lets run the same test with 100 threads and see how the system responds. Now normally for SQL Server to throw 100 IO threads at a storage solution you'll need a huge number of schedulers running otherwise you won't be anywhere near this number of IO threads pounding the disks. With this massive number of threads the Pure Storage array was still able to handle the workload pretty well with the same IO and throughput numbers as before. However the latency histogram tells us a much different story now. Where our requests before were being handled in 1-3ms the latency has gone way up with an average latency of 126ms and a maximum latency of 139ms. While these high latencies were being seen at the server running SQLIO the storage array's metrics were reporting very minimal latency, typically in the 3-5ms range. After doing some digging into the metrics being reported by the array, the fiber channel switch and the test server it became very clear very quickly that our bottleneck in this case wasn't the storage array at all, but instead the fiber channel switch which only for 4 Gig FC ports in it.
Back into the real world
Now that we've seen that the overall environment has some limits on what it can handle lets get back into the real world for a little while (all these tests were done using random workloads). 16 threads is a realistic number of write threads for the SQL Server to be working with. Running 16 threads the Pure Storage array was able to accept the IO and throughput that we have been seeing so far. And the latency, while a little higher than when running with just a couple of threads, is still well within the acceptable range with 99% of the IO being handles between 17ms and 22ms.
Now that we've got the system with a little bit of stress and a realistic workload lets see how it handles reads. Again throwing 16 threads at the system we see basically the same numbers as before. IOs per second about 6250 with throughput of around 390MBytes per second and a nice grouping of latency between 18ms and 21ms.
Now real world servers can be a lot bigger than 16 threads, so lets ramp up a little bit and go to 64 write threads. Testing again the IOs per second and throughput are exactly where they have been. The latency numbers are still looking pretty good. We now have an average latency of 81ms and a maximum latency of 94ms. Running the same 64 threads but this time reading on those 64 threads the latency numbers are right in line with the write numbers with an average latency of 81ms and a maximum latency of 118ms. Again the switch itself has become the bottleneck here which we can see in the screenshot below of the performance latency which shows that the array was seeing latency of less than 5ms at all times.
Reads and Writes Together
Obviously in the real world reads and writes need to happen at the same time. So for the final SQLIO test I ran 64 read threads and 64 write threads at the same time, for an hour. For this final set of tests using real world database files, all threads were using a random workload and a 10,000MB data file and 64k IOs. This was done by running SQLIO twice at the same, one doing writes and one doing reads.
The interesting thing here is that both the read and write threads showed ~6200 IOs and ~390 MBits per second in each direction which tells me that both HBAs in the server were in use. As for the latency the reads where averaging 81ms with a max latency of 155ms and the writes were averaging 81ms with a maximum latency of 424 ms.
I was able to improve these host side latency numbers by making a few changes to the machine that I was running the tests on. I started by changing the virtual machines driver from the default LSI driver to the VMware Paravirtual driver. The second change that I made was to change the fiber channel paths for the RDM from "Most Recently Used" to "Round Robin". The third change that I made was to change the queue depth from the default of 32 to 128. This increased the IOs/sec that were pushed to the array to ~6350 for reads and ~6350 for writes and allows for ~400 MBs/sec for each reads and writes. The really impressive number is the host side latency number which dropped from a maximum of 424ms to 145ms with the average dropping from 81ms to 79ms.
As Virtual Machine Storage
Another set of tests that I did was to connect the Pure Storage array to a set of VMware vSphere 5.0 hosts and do some Virtual Machine deployments. When doing this I had a single template which was thinly provisioned which I then deployed. As a part of the deployment the virtual machines were configured and rebooted. Each virtual machine took only 37 seconds (some took as long as 39 seconds) to deploy. The template was a 9 Gig template, so this was 9 gigs of sequential reads along with 9 gigs of sequential writes. In other words about the worst thing that you can have flash do, and the Pure Storage array was still able to do a fantastic job with it. I attribute a lot of this to the built in data deduplication which was happening inline so that the bulk of the 9 gigs of data didn't actually need to be written to the storage array at all as it was duplicate data.
If these tests prove anything it shows that in order to max out the performance of the Pure Storage array you'll need to push through a LOT of IO at the unit very quickly, and have an 8 Gig fiber channel infrastructure in place to even think about maxing out the unit. Honestly, based on these numbers even I I had access to an 8 Gig switch I don't think that I'd be able to max out the array. I hope that you found reading this as interesting as I did putting together these numbers,