Danger Averted: A DBA Horror Story

Most technical articles written explore new features of existing products or identifying a problem and providing a solution.

This is not one of those articles.

Instead I want to tell you a story. A story of what could have turned into a very bad situation that likely could have taken down an entire company and made the news cycle for a few days. In the end it’s also a story about understanding. Understanding of technologies you’re responsible for implementing; of requirements for your customers’ solutions; of the value of training and when training should take place.

Like many of the stories from our childhoods it should start with those magical words that take you to another place and time:

Once Upon A Time I was working with a client who was in the process of migrating to Amazon Web Services (AWS). They had built out about a dozen EC2 boxes (EC2 stands for Elastic Compute Cloud) and had turned these over to me to pair into clusters, install Microsoft SQL Server, and overlay and Availability Group topology for high availability. This environment was to host financial and credit data for their entire company. It was critical that this data be secured not just from security intrusion but also from any risk or harm to its very existence.

That last line may sound a bit dramatic, but as you’ll soon see it was a likely outcome of how they had built their AWS virtual machines.

EC2 VMs can be spun up quickly and come with a single disk volume to start. If you’re reading this article then you are already aware that is not a viable option for a mission-critical (nay, company-critical) database server. The client was smart enough to understand this as well and at least had provisioned a secondary volume for each EC2 instance. This is where they wanted all the SQL Server database files – system and user, all log files, and all backups stored. Ultimately they’d add S3 and Glacier storage into the solution which allows for redundant off-server storage for their backups and archives but for now it was just the EC2 instances hosting their future data and backups. While I tend to have data, logs, system databases, tempdb, and backups separated – and none hosted on the same volume as the operating system and pagefile – this is the world I had to work within.

I quickly went to work installing Microsoft SQL Server, building the clustering and Availability Groups as well with cleansed copies of their data to work with as placeholders to get the AG solution built. Failover testing looked good and we were flagged as a GO for migration in 3 days.

There was a minor change that needed to be done to one of SQL EC2 instances used for various monitoring databases for third-party tools that was also provisioned in AWS. The server engineers needed to clone the EC2 server and shut down the EC2 instance in the AWS dashboard to do just that. As a precaution I took backups of all the databases and copied them to another SQL EC2 instance temporarily and then disabled the SQL service and shut down the service. When they brought the server back up SQL Server wouldn’t start up for me. Odd yes, but I’ve seen this before and you don’t last long as a DBA if you don’t handle adversity in a calm and rational manner. Looking at the event logs I came to a conclusion that I had not run across in my 20 years in data. Opening Windows Explorer confirmed what I was afraid of…

There was only one drive on the EC2 instance and it wasn’t the one hosting all the SQL data, logs, and backups.

You may start to wonder why shutting down a VM would cause an attached volume to disappear – not to mention a volume you were paying good money to have around for your most critical data. For that you need to know what the possible offerings are for storage in AWS.

AWS and Azure alike have multiple options for storage. This story could have been about either cloud platform except that since this actually did happen with AWS I wanted to stick to that part of the truth of the matter. In AWS the possible storage options for volumes that can be “directly” attached to an EC2 instance include Elastic Block Storage (EBS) that is persistent local storage for Amazon EC2 that is popular and sufficient for relational and NoSQL databases, data warehousing, enterprise applications, Big Data processing, or backup and recovery. Another option is Amazon Simple Storage Service (S3) which allows for creating “buckets” accessed via URLs that can be hit from any Internet location granted security is properly provisioned to those attempting to access it. There is also Amazon Glacier, popular as an archive/repository file system that has a 3-5 hour access “hit” penalty but is extremely cheap. Of all of these storage options listed, EBS is the only option for direct-attached storage for databases. There is also another type of storage available for direct connect to EC2 instances that isn’t listed on the AWS storage offerings page: Amazon EC2 Instance Store.

Amazon EC2 Instance Store is best explained by the official documentation introduction:

“An instance store provides temporary block-level storage for your instance. This storage is located on disks that are physically attached to the host computer. Instance store is ideal for temporary storage of information that changes frequently, such as buffers, caches, scratch data, and other temporary content, or for data that is replicated across a fleet of instances, such as a load-balanced pool of web servers.”

The phrase “temporary” should send chills down your spine. When I found out that the attached volume for the monitoring server disappeared on shut-down my greatest fear was anyone shutting down any of the SQL Server instances we had spent considerable time on to get to a point of launch. Hours and weeks of work. The only saving grace was the fact that the company wasn’t yet running their entire data platform for finance on these servers.

Now the question is what did I do to fix the issue? The solution was quite simple considering the entire volume was dedicated to just the database files and backups:

Step One was to provision the proper type of storage: EBS. This was performed by the customer’s server engineering team. Once done I formatted the new volume and gave it a drive mapping of T. (The existing drive we needed to swap out was the D drive.)

Step Two involved ensuring the cluster underlying the Availability Group was stopped. This was a two step process of stopping the cluster role associated with the AG and then stopping the cluster itself inside of the Failover Cluster Manager. We were doing this work simultaneously on both replicas in each AG so keeping the cluster up was not necessary and could actually lead to issues.

Step Three shifted me from cluster and Windows to SQL Server. It was now time to stop SQL Services and set them to disabled. Why disabled? Just as an insurance against anything unruly happening when we did a restart of the EC2 instance later on.

Step Four was a simple file copy of the entire D drive over to the T drive at the root level. I wanted both volumes to match in everything but drive letter. Once that finished it was time for Step Five.

Step Five involved going into the Computer Management Console and then Disk Management to change the drive letter of the current D drive to any other letter that was not in use. Then when the D mapping was freed up we could change the drive mapping of T to D. Keep in mind the entire time the SQL Server services were all down. All the startup information points to the D volume and we now have a new D volume that looks identical to the old one but is on storage that is persistent through instance shutdowns at the AWS dashboard level. (The drive maintains state during a VM guest reboot that I had performed multiple times over the course of configuring each EC2 isntance.)

Now was time for the test, bringing SQL Services back up and running from the SQL Server Configuration Manager. We did this for the primary instance first and verified the databases were accessible and no errors were in the log. I quickly followed suit with the SQL services on the secondary. Then it was the matter of reversing out of our work. I took care of the SQL Services bringing them online and then brought the cluster back online followed by the cluster resource for the AG. Everything looked fine but we still needed one last test: an EC2 shutdown.

I enlisted the server engineer to shut down the instance for the secondary after having made sure the AG was in asynchronous mode and stopped the SQL services on the secondary. This process actually takes a good 5-10 minutes but when all was said and done the “old D” drive was gone and the new D drive remained. I fired up SQL Server and the AG became healthy and in sync rather quickly. We repeated the process with the primary replica after I failed over to the replica we had just tested the restart on and after that came back online we did another failover to get us back to where we were.

Ultimately we repeated this process on the other AGs and were able to keep to schedule. In all the success and the possible catastrophic outcome a few things should be taken as lessons learned.

First of all: don’t implement technology that is unfamiliar. The customer had limited knowledge of the AWS platform and offerings. They didn’t do due diligence, didn’t reach out for assistance through the support channels they were paying for for architectural guidance, and they didn’t even train their technical infrastructure staff charged with implementing (and architecting the solution initially) until after the environment were built. Training may seem expensive but losing your entire financial footprint and not being able to perform any billing tasks while you rebuild your infrastructure from the ground up is even more expensive.

Secondly: double and triple check the work you’ve relied upon others to complete in advance of your work. From the outside these drives look identical. But reviewing the technical specifications and even just a 5 minute review of the AWS dashboard would have exposed this issue well in advance of when it was discovered.

Finally: don’t be afraid to ask questions and seek out assistance when working in unfamiliar areas. None of us know everything. Success is a team effort.

In the end crisis was averted, the customer was happy, and I was paid; and dare I say it: they all lived happily ever after.

Comments

Plain text