The news reported by the redoubtable Ross Smith IV in the Exchange team blog last week that “the number one reason why our Premier customers open Exchange 2010 critical situations is because Mailbox databases dismount due to running out of disk space on the transaction log LUN…” caused me to ponder whether this is an example of Microsoft being a victim of its own success. Let me explain why.
Microsoft has been on a crusade to improve the I/O profile of Exchange Server ever since the storage required for major Exchange 2003 deployments proved to be both expensive to procure and difficult to deploy. Essentially, in those days the Storage Area Network (SAN) was king because Exchange was a bit of a pig in terms of the I/Os that it consumed.
The engineering team did a lot of tweaking to the database to improve matters in Exchange 2007 and succeeded in reducing the required I/Os by a factor of some 70 percent, providing of course your storage was configured in the same way as Microsoft’s and your users generated the same kind of robotic transactions that test suites do. Even if you were different, the changes were good and reduced Exchange’s I/O profile to a point where SANs started to become less of a critical success factor and more of a nice-to-have for deployments.
Exchange 2007 was all about tweaking. Exchange 2010 delivered a complete overhaul. In car terms it’s kind of like Exchange 2007 was the minor tune-up whereas Exchange 2010 bored out the engine and added a supercharger to the database. The resulting improvement allowed Microsoft to begin its campaign to convince everyone that low-cost storage is now a viable platform for Exchange databases and JBOD became the key phrase. Of course, you could continue to run Exchange on a SAN but it really wasn’t the modern thing to do . . . at least, not in the eyes of the marketing folks.
Getting the Exchange database to a point where it can support low-end disks is a very good thing. For one thing, it’s extremely unlikely that Microsoft could offer the Office 365 plans at their current price point if the hardware economics were not right. Storage is a big contributor to hardware costs, so it therefore follows that driving down those costs means that more aggressive pricing is possible. And every cent counts in a world where Office 365 has to compete against competitors ranging from free consumer-style email to the likes of Google Apps.
But there’s a downside of cheap storage. Low-cost disks are cheap because they are manufactured to certain price and performance specifications. Higher cost enterprise-class disks have to meet completely difference specifications such as Mean Time Between Failures (MTBF). Low-cost disks have a distinct tendency to fail more often than enterprise-class disks. This is acceptable if you have the operational maturity and systems in place to detect and address the failures and applications are capable of resisting disk failures. The advent of the Database Availability Group (DAG) in Exchange 2010 allows databases to carry on through failure events provided that there’s a healthy database copy available to take over when bad things happen. Hence the advice to have at least three copies of mailbox databases within a DAG to ensure best availability.
Coming back to where we started, another piece of advice that started to circulate when Microsoft started to pump out the wisdom that Exchange 2010 was able to run on low-cost storage was that it was now acceptable to position both database and transaction logs on the same physical disk or LUN. After all, if you had sufficient database copies deployed, you really didn’t need to worry too much about protecting transaction logs in the same way as we did in earlier versions of Exchange. And it seems that people have taken this piece of advice to heart and merrily deployed transaction logs on the same disk that holds the database and its content indexes instead of isolating the logs on their own disk as occurred with previous releases. At least, that’s one of the reasons that comes to mind for exhausting disk space.
Of course, yet another line spun by some Microsoft speakers around the introduction of Exchange 2010 might be even more of a contributing factor. The thought went that native data protection in Exchange 2010 (aka the DAG) means that you don’t need to run daily backups any more. Or, if you were very extreme on the topic, you never need to run backups again. This is complete tosh.
One of the big clean-up functions that Exchange full backups have done since the product first appeared fifteen years ago is log truncation, which means that when a successful full backup is performed, Exchange can delete all of the transaction logs that it no longer requires (because they have been captured on the backup media). Log truncation essentially freed up disk space that could then be consumed by the transaction logs generated before the next full backup is performed.
Transaction logs are very important to Exchange. They capture complete transactions in a secure manner to ensure that the transactions can be replayed into a database should this ever be necessary and they provide the mechanism for replication between member nodes in a DAG to allow database copies to be maintained. For these reasons they deserve tender loving care from Exchange administrators.
Even though the magic of Microsoft might lead you to believe that everything works beautifully all the time in Exchange 2010, no matter what kind of disks you use or backup regime you employ, you still need to pay attention to how logs are generated and managed thereafter. So take the blog post by the Exchange team as a tremendously valuable pointer to something that should be reviewed in the context of your own deployment with the aim of improving matters if possible in the coming week. You know your logs will appreciate the effort.