Incidents with a single service are probably the most common type of incident and usually get resolved fairly quickly. Examples here are a blown power supply that results in server failover (or not), bad GBICs that result in dropped packets, daemons that crashed, etc.
The incident usually starts with a email, a text message, a phone call from the NOC, or one of your clients popping their head in your office with their hair on fire. You need to get your head straight and quickly run through a simple process:
Get your head straight
First, stay calm. The worst thing you could do is cause a major outage, destroy some data, or make the existing problem worse in a panic. Simple problems can easily become large complicated problems after a few bad decisions made in haste. Take a breath before continuing. This is especially important with a page at 3AM or if a panicky client is in your office. Tell the client you’ll handle the problem and run through your normal procedure.
Don’t be a hero. Get someone else to run the incident if your judgment is impaired due to lack of sleep, alcohol, or medication.
Remember the prime directive – your job is to restore service as quickly as possible. You are not there to debug interesting problems with your service.
Solve the problem immediately if it’s a simple problem and you can do it in under a minute
It can take up to five minutes to get into a room, get the right people on the line, etc. Just fix it and send out an email after the fact if you can fix it in less than a minute.
For example, let’s say you get an alert from your performance monitoring system that some of your connections are timing out followed by an alert that one of your webservers is running out of memory (but all other webservers are looking fine). Let’s say a large job on the webserver is creating an intermittent high load but it’s not enough to yank the machine out of the load balancer. By all means, yank this webserver out of the load balancer pool so new connections aren’t forwarded to that machine to restore service quickly.
In doing so, you’ve resolved an intermittent problem with the service and you can then send out an advisory alert before debugging what went wrong or coming up with a better solution to the load balancer monitor (or you can go back to sleep).
The danger here is letting this one minute investigation turn into five or ten minutes. If you can’t fix it in a minute, get busy with the normal process.
Do the normal process
Determine the severity, communicate an alert, and get backup.
I’m always stunned when I see someone take a call at 3AM and jump right in to fixing the problem without assessing whether it can wait until morning. Determining the severity is a critical step in ensuring that you don’t turn a problem in the middle of the night into a much larger problem. If it can wait until business hours when you’re awake and more people are around for support, by all means wait until the morning.
You also want backup if it’s an extended event (more than five minutes). The extra person can take over communication duties while you resolve the problem. This two-person approach works well. First, you don’t forget to send regular updates and second you have a co-worker to bounce ideas off of. Just talking through a problem with a co-worker can often help identify the problem. Don’t be a hero. Get help.
Make sure you have a screen sharing room and phone conference reserved for your team. You can’t waste time passing around bad passwords or having multiple people setup phone conference whack-a-mole during an outage. Think of this like a fire drill. Everyone should know how to exit the building and where to assemble.
A normal process means you’ve gone through this before. If you haven’t – practice with simulated events! Also make sure you have standard email templates on multiple servers or your laptop. It’s critical that your email template identifies the incident severity, a short description, and the time of your next update. This will prevent panicky people from bugging you with status requests.
Identifying what is actually wrong is an entire art in itself and something I’ll try to address in a future post