Almost all web sites out there are hosted by some other company. Cube Rules relies on a hosting partner for the servers, software, network and support so I don’t have to have servers and the connection to the Internet here in my house. It is a good service and it works most of the time. “Most of the time” being the operative words. Since I’ve had Cube Rules, I’m now on my third hosting vendor because I was not satisfied with the service or page load times from the service. My third hosting vendor had a big hiccup on Sunday night and it cost me my site for most of Monday and when the site was up on Monday, the performance sucked.
From the e-mail on what happened:
Two nights ago as most of you know (we sent out an email blast about it) there was a short spurt of downtime as a result of a failed drive in one of server9’s array. No problem, or so we suspected. One of the technicians on-site (3rd party) swapped the wrong drives in the array which quickly brought the system down temporarily until it was reverted. Again, normally a null issue (though some glitches expected, not life-expectancy reducing by any means).
To cut things short: However, as a result of the system being under heavy load later that afternoon it croaked. We attempted multiple types of revival, tweaks and modifications to bandage it up so it would survive peak hours & we could move users this evening. Our first, and only error throughout the entire ordeal was that we took far too long attempting to regain existing system viability instead of simply giving up 30 minutes in and starting a bare-metal restore to more powerful hardware.
That means things were so bad they threw up their hands and started over with a new server (bare-metal restore). Fortunately, it is a much better server.
I’ve worked in and around IT long enough to know that bad things happen from the strangest of places — like replacing the wrong hard drive in an array. All the processes in the world with all the methodology in the world won’t prevent all bad things from happening. You might get close, but you won’t be perfect. When “not quite perfect” is particularly bad, you end up with a site that isn’t available to my users for most of the day.
It ends up like this:
We’ve stated time and time again that emergencies are going to occur, and there’s little we can do to stop them, but we can mitigate their effects & prepare for every angle upfront. During the past few months we’ve done nothing but concentrate on our back-end to prepare for the worst: And, in all honesty, we’ve done alright despite the circumstances, but there’s still significant amounts of room for improvement and we’ll continue to do so.
What this hosting company did right, despite the awful outage was this:
They prepare for outages up front. They communicate proactively when there are problems. They formulate or use a mitigation strategy to minimize the problem. They are not afraid to decide that is right despite the big downtime to users. They offer a credit for the downtime and implement it. They have systems in place to make sure they have all options covered for communication. They learn from their mistakes.
Now, you might think that has nothing to do with you and your job. But it does. Think of me owning this site and having a vested interest in having it work right and my hosting company as one of my employees. Things are going to go wrong between us, no matter how great we are in our work. Would you rather have the employee that has thought these things through and communicates issues early and often? Or would you want the employee that hopes you never found out there were issues in their work and you discovered it without them telling you?
Dig up your own mud. Communicate the mud with your manager. You’ll be perceived as proactive and consultative looking for help and support. If, of course, you have a good manager.
When something bad happens that your manager needs to know, what do you do?