Admitting that you have made a mistake is never easy. When it happens at work, it can be even more difficult to own up to because in some cases you may feel your career is on the line. But how you deal with the aftermath of a mistake as an IT pro can go a long way in showing your manager that you are dependable in a crisis and know how to use creative problem-solving under pressure.
While not all IT screw-ups go as viral as this one on Reddit, mistakes can happen at any time and to companies of any size. For a recent example, look at Cisco. Earlier this month, its engineering team made a configuration error on its Meraki object storage, leading to loss of user data. The company had to work over the weekend to investigate what data could be recovered and what tools it can build to help customers identify the data that had been lost.
In this example, the issue was externally facing, so a PR strategy had to be devised. But what if the issue is something more internally-facing, like a company server that has been fried with no backups?
IT Pro asked experts to weigh in on what the best course of action is in the case that a server goes down, and the backups that were supposed to be in place are not working. What are the steps that an IT pro should take before, during, and after to recover from the situation and move on?
Before a similar scenario happens to you, it is important to know that there are many preventative steps that can be taken. But if it does happen, know that you are hardly the first person to deal with it, and you won’t be the last.
Understanding the Business Requirements of Backup
ClearSky Data CTO and co-founder Laz Vekiarides recalls a situation with one of his customers who was backing up data from a legacy system, only to realize that the data they needed to retrieve had actually been lost … five years earlier. The problem was that the data backup was not tested, so no one knew it wasn’t working until it was too late.
“They wanted to retrieve a piece of data that was backed up and they realized that their backups were fried,” he said.
Not checking that backups are working is one of the biggest mistakes Vekiarides sees companies make. He said that companies should have a discipline of doing a cursory check on systems now and then to make sure that they work and bring back data.
Vekiarides has worked in technology for around 20 years, starting in networking before transitioning to data storage in 2002, where he ran the development team at Equallogic. When it was acquired by Dell in 2007, he ran the software development organization. He founded ClearSky Data along with CEO Ellen Rubin back in 2015 to provide storage-as-a-service to customers storing hundreds of terabytes of data.
“The best practices in general involve periodic backups and you have to make sure that you adjust your backups either snapshots or physical copies of data … you need to make sure that they are done with the correct periodicity,” he said. “Each application is different. Each application has a particular window of time which is the amount of tolerable data loss, and it really depends on the business need.”
For example, he says, test and development data will require a different resiliency plan than a system of record for a point of sale system, which has a “very low tolerance for data loss.”
A mismatch between backup and business requirements can be disastrous. Vekiarides said he has seen one example where researchers building enormous data sets with hundreds of terabytes of data are “throwing them on storage that is not backed up at all.”
“So if anything bad were to ever happen, they would lose a year’s worth of work,” he said.
Marty Puranik is CEO of Atlantic.Net, a cloud services and web hosting company based in Orlando. He agrees that it is pertinent to test backups. It is advice that GitLab shared earlier this year among lessons learned from when it lost 300GB of customer data from its primary database server.
“The first thing is to test your backups so you don't end up in that situation. But, if something bad happens like this, the best course of action is to come clean and start working on the next steps. By doing this, you avoid the problems of trying to cover-up and get to a solution faster which is what management really wants,” Puranik said in an email.
‘Crowd-Sourcing Panic Mode’
Let’s say you thought your backups were running smoothly, but something has gone wrong and now you realize you can’t retrieve the data.
First, the experts agree: don’t panic, and definitely don’t try to ignore it.
“Eventually someone is going to notice that this data is missing, especially if it’s critical data for the company,” John Martinez, Evident.io's VP of Customer Solutions said. And worse than losing your job, the company could face legal action depending on the type of data that has been lost.
Martinez has worked at cloud security company Evident.io for the past three and a half years. Prior to that he worked in sys admin roles at Netflix and Adobe Cloud. He said he has seen scenarios in his career where backups disappeared and data is long gone, but fortunately there are ways to recover.
The first question to ask yourself if you are in this situation is how important is that data? If it is absolutely critical, there are strategies that you can employ to help piece together some of the missing data, Martinez said. The first one is what he calls “crowd-sourcing panic mode.”
“We start talking to engineers that have been in the organization for a long time, they might have something squared away on their laptop, they might have an offline copy somewhere,” he said. “This is where you sort of go from ‘what are security best practices of having the data and the data retention’ to where ‘we’re not going to judge you based on you having a copy of this particular data on your laptop because you’re saving the company’s bacon.’”
“In all of the situations I’ve lived, even though it might be egg on the face for the person responsible for the backup policies, etc. it’s one where we sort of rally around and try to do what’s right for the business,” he said.
“One piece of advice that I’d have for an up and coming systems engineer that’s in the thick of it is to not panic, don’t worry about losing your job, but worry about doing what’s right and getting the data back,” Martinez said.
Atlantic.net’s Puranik adds that keeping a level-head and acting professional throughout is critical.
“If you handle it as a professional, and get through the crisis it’s usually OK,” Puranik said. “In addition, you would want to follow up with a post-mortem explaining how it happened, and what you're doing, so it doesn't happen again. Most managers realize IT pros are human, but it’s important for IT professionals to act professionally, especially when things go wrong (as they always will be given enough time).”
Reverse Engineering, Using Cloud Tools
From a technical perspective, there are ways for IT pros to piece together missing data through reverse engineering, Martinez said.
In the case of data loss around a product, look at code repositories to get back to the best state you can so you can fill in the blanks, he said.
Cloud providers provide tools around snapshots or multiple regions which can be extremely useful in the instance of retrieving data. “I can have much greater insurance tools at my disposal with the cloud,” Martinez said.
Finally, make sure that you identify what happened so you can fix that business problem moving forward. Be sure to do your research and leverage the tools that exist to help you automate a lot of the processes around data backup -- just don’t forget to test them.