How a site outage is like your relationship with your manager

By Scot Herrick | Job Performance

Jan 05

Almost all web sites out there are hosted by some other company. Cube Rules relies on a hosting partner for the servers, software, network and support so I don’t have to have servers and the connection to the Internet here in my house. It is a good service and it works most of the time. “Most of the time” being the operative words. Since I’ve had Cube Rules, I’m now on my third hosting vendor because I was not satisfied with the service or page load times from the service. My third hosting vendor had a big hiccup on Sunday night and it cost me my site for most of Monday and when the site was up on Monday, the performance sucked.

The bad. That turns to ugly.

From the e-mail on what happened:

Two nights ago as most of you know (we sent out an email blast about it) there was a short spurt of downtime as a result of a failed drive in one of server9’s array. No problem, or so we suspected. One of the technicians on-site (3rd party) swapped the wrong drives in the array which quickly brought the system down temporarily until it was reverted. Again, normally a null issue (though some glitches expected, not life-expectancy reducing by any means).

To cut things short: However, as a result of the system being under heavy load later that afternoon it croaked. We attempted multiple types of revival, tweaks and modifications to bandage it up so it would survive peak hours & we could move users this evening. Our first, and only error throughout the entire ordeal was that we took far too long attempting to regain existing system viability instead of simply giving up 30 minutes in and starting a bare-metal restore to more powerful hardware.

That means things were so bad they threw up their hands and started over with a new server (bare-metal restore). Fortunately, it is a much better server.

I’ve worked in and around IT long enough to know that bad things happen from the strangest of places — like replacing the wrong hard drive in an array. All the processes in the world with all the methodology in the world won’t prevent all bad things from happening. You might get close, but you won’t be perfect. When “not quite perfect” is particularly bad, you end up with a site that isn’t available to my users for most of the day.

Turn your lemons into lemonade

It ends up like this:

We’ve stated time and time again that emergencies are going to occur, and there’s little we can do to stop them, but we can mitigate their effects & prepare for every angle upfront. During the past few months we’ve done nothing but concentrate on our back-end to prepare for the worst: And, in all honesty, we’ve done alright despite the circumstances, but there’s still significant amounts of room for improvement and we’ll continue to do so.

What this hosting company did right, despite the awful outage was this:

  1. When the first outage occurred, an e-mail blast went out explaining there was a problem
  2. When things got worse, another e-mail went out to the affected sites explaining what the plan was
  3. When the move to the new server was happening, progress reports were e-mailed out
  4. When the outage was considered over, an e-mail was sent to let everyone know that all is well — so go check your sites
  5. And since e-mail accounts were blown up (mine included) when the sites were down, if you opened a support ticket, the updates were added to support tickets as they happened.
  6. A last email (from which I quote above) that provides a summary of the event, the steps taken to restore, and what the lessons learned were from the event.

They prepare for outages up front. They communicate proactively when there are problems. They formulate or use a mitigation strategy to minimize the problem. They are not afraid to decide that is right despite the big downtime to users. They offer a credit for the downtime and implement it. They have systems in place to make sure they have all options covered for communication. They learn from their mistakes.

Think this is about some vendor? No, it’s also about you.

Now, you might think that has nothing to do with you and your job. But it does. Think of me owning this site and having a vested interest in having it work right and my hosting company as one of my employees. Things are going to go wrong between us, no matter how great we are in our work. Would you rather have the employee that has thought these things through and communicates issues early and often? Or would you want the employee that hopes you never found out there were issues in their work and you discovered it without them telling you?

Dig up your own mud. Communicate the mud with your manager. You’ll be perceived as proactive and consultative looking for help and support. If, of course, you have a good manager.

When something bad happens that your manager needs to know, what do you do?

Photo by BobMical

  • Zack Pike says:

    Great correlation Scot! Too many times companies and employees alike want to hide their faults… When in reality if they just communicated their “mud” everyone would benefit. I actually went through a similar hosting situation several months ago… Only difference is that they didn’t communicate their mud to me. I had to find out my site was down on my own, call them, call them again, call them again, then make a big stink on Twitter to get them to do anything about it. They ended up making good in the end… But I often wonder if they just did that because I was tweeting my experience.

  • Anonymous says:

    I know that was a hugely frustrating experience for you, Scot. And I love the on-message lesson you’ve highlighted in it. Fabulous!

  • >