If something can go wrong, it will go wrong. Usually at about 4:50 p.m. on a Friday afternoon when you’ve got reservations for a meal at a nice restaurant with your partner at 7:00 p.m. Even more likely if you’ve organized a baby sitter for the evening. Nothing attracts bad luck like the possibility of extreme inconvenience.
It’s at this time that the whole idea of the “Cloud” sounds awesome - because surely if you used the “Cloud” you wouldn’t have some storage array on the SAN in your data center try to chew itself to pieces in some sort of bizarre late Friday afternoon suicide ritual. Well it might happen - but that’s the “Cloud’s Problem” and wouldn’t be yours. Perhaps infrastructure outsourcing is a more direct method of redirecting bad systems karma to another team of geeks.
I’m not sure how superstitious most systems administrators are, but I’m definitely one who assumes if that someone says “it can’t get any worse than this” then odds are that the universe is going to find a way to prove that statement incorrect.
Systems Administration is the art of operationalizing pessimism. You think up ways that stuff can go wrong and you then come up with work arounds. You back up data so that in the event that it becomes corrupted or the disk hosing it fails, you’ve got a work around. You use clustered servers so that if one server fails spectacularly, you’ve got another server there to take the load. You use redundant networks so that if one switch or router decides to fry its internal electronics, you’ve got another one that will quietly keep the packets flowing.
But you don’t need to cluster everything and you don’t need redundant networks everywhere. In some places you’ll be fine with the downtime it takes to pull a spare bit of network hardware out of storage and replace it, rather than spending money so that each piece of network hardware has a failover. You don’t need to host every SQL Server database on a failover cluster. In a lot of situations, just using replication to another SQL box will be adequate.
One of the problems that Systems Administrators face is that as human beings, we really aren’t very good at assessing risk. That’s why we get all panicky about the possibility of Sharks when we’re swimming at the surf beach in summer, but we don’t really worry about the drive down to the surf beach, even though, statistically, we’re more likely to come to harm on the drive than in the water.
In some ways a lot of profit in the IT security industry is based on the inability to assess risk. It’s easier to sell a solution to a scary problem than it is to sell a solution to a more prosaic one.
As a systems administrator, it is necessary to be rational about our pessimism. We only have so many resources and so much time, so we have to protect against the things that are likely to cause us problems, not the things that might, in theory, cause us problems. It isn’t the 1 in a million events that we need to deal with first (with apologies to Terry Pratchett’s Discworld probabilities), but the 1 in a thousand events and the 1 in ten thousand events. Figuring out the precise probabilities of certain events occurring is very difficult (even actuaries are making guestimates) - but when you’re assessing risk, try to order your risks into “more likely” and “less likely” and deal with the “more likely” ones first.
If you deal with the more likely risks first, you’re also more likely to be able to make that dinner at 7:00 p.m. on Friday night instead of spending it knee deep in the guts of a server finding ever more creative ways to use expletives to describe your precise views about the profession of systems administration.
Follow me on twitter: @orinthomas