Search this blog

Friday, April 30, 2010

Tips for Handling Events, Incidents, Outages, and Maintenance

I get a lot of questions from new service teams about what they should do to prevent downtime but very few people ask for advice on how to handle an incident. This is a bit like asking a boxer for the best way to avoid getting in the ring. It’s not a question of “if” you’re going to be in the ring but “when”. There’s an old saying – the more you bleed in the gym, the less you bleed in the ring and that definitely applies to incident management as well.

Having sat in on more war rooms than I’d like to remember, I thought it might be handy to write down some of the things that my team has found useful over the years. I think every service organization should have a standard approach towards three specific activities:

1.    Tips for Handling Service Incidents (just one service)
2.    Tips for Handling Service Outages (multiple services affected)
3.    Tips for Handling System Maintenance

I hope these posts help you with your handling of incidents, outages, and maintenance. Success here is mostly about being prepared, being calm, good communication, and practice, practice, practice. If you think your service is bullet-proof and you won’t need the practice – you’re wrong :-)

Tips for Handling Service Incidents

This is the first post related to “Tips for Handling Events, Incidents, Outages, and Maintenance”.

Incidents with a single service are probably the most common type of incident and usually get resolved fairly quickly. Examples here are a blown power supply that results in server failover (or not), bad GBICs that result in dropped packets, daemons that crashed, etc.

The incident usually starts with a email, a text message, a phone call from the NOC, or one of your clients popping their head in your office with their hair on fire. You need to get your head straight and quickly run through a simple process:

Tips for Handling Service Outages (multiple services affected)

This is the second post related to “Tips for Handling Events, Incidents, Outages, and Maintenance”.

You’re about to have a interesting day/night. Multiple business-critical services are offline or having intermittent problems. It affects revenue, your company-wide-outage processes have been started, and you’re the lucky person on the on-call roster to lead the outage.

This is a different beast than resolving a problem with an single service. Here, you’re going to have to coordinate across several services and try to get the entire system up and running as soon as possible. The biggest obstacles here are coordination, communication, and discipline. This requires lots of practice before you get good in the role.

Tips and Tricks for System Maintenance

Tips and Tricks for System Maintenance

This is the third and final post related to “Tips for Handling Events, Incidents, Outages, and Maintenance”.

System Maintenance is a more relaxed version of the previous posts but there’s still a high probability that you might cause an incident during your maintenance. The difference is that you actually get to plan for it. System upgrades and maintenance can be very complicated and involve multiple teams doing many things in parallel. I’ve seen several maintenance periods where there wasn’t a clear plan and people were working on things when they weren’t asked to causing all sorts of problems (imagine your network team killing the firewall before you diverted services to another cluster) and this is never a pretty situation.

Monday, April 26, 2010

HBase Performance Post

The folks at posted a great overview of their latest HBase performance testing. There are some heard-earned performance testing nuggets in the article along with interesting random read/write and map-reduce stats. They also have a small peek at their hardware configuration (lots of spindles!).