Search this blog

Wednesday, May 5, 2010

Planning for SaaS Infrastructure Failure – part 1

There's nothing quite like a good Single Point of Failure (SPOF) during a holiday dinner. Most folks I talk to immediately think of network redundancy when I bring up this topic and tend to look at me strangely when I talk about other parts of their infrastructure. I thought I'd throw a short post together on how to plan for failure in your SaaS infrastructure but, as usual, it's turned into a much longer post than I've intended and I'm going to break this post into three parts. Here we go with part 1.



Start with a meteor strike
Most people laugh when I say this but you should literally start your failure analysis with "what happens if a meteor strike takes out the building or power lines to our data center". In other words, start with a full disaster recovery situation and work your way all down the stack. If you think this is overkill, remember that two highly redundant data centers were both taken out by previously unknown single points of failure.

Before jumping into these examples, I want to point out that Rackspace and 365 Main are two very well respected companies so this isn’t some fly-by-night operation that was taken out. Both providers also did an excellent job of publishing what happened on their websites after the event (see links below).

Rackspace Examples

Rackspace suffered two outages in June and July of 2009 at their DFW facility. In the June outage, a power interruption resulted in a switch to generator power. So far so good as the backup systems kicked in. Unfortunately, the generators failed.

Battery power only lasts for around 10-15 minutes in most of the data centers that I've worked at so you've got 10-15 minutes depending on the power draw (and the state of your UPS batteries) to get the generators running. Not much room for error there.

In the July 2009 outage, Rackspace again switched to generator power due to a power interruption. Unfortunately, a "bus duct" prevented proper operation of a UPS and some customers lost power to their servers for about 20 minutes. They also suffered a loss of network connectivity due to the power disruption.

In 2007, there was also the infamous truck crash where a traffic accident damaged a nearby utility transformer, knocking out power to the Rackspace's Dallas facility. The company switched over to generator power but two chillers failed to start back up again. Servers had to be shutdown to prevent damage from overheating.

365 Main Example

365 Main is a major data center smack in the middle of the San Francisco. This is probably starting to sound familiar -- a power interruption caused a fail-over to generators and three out of ten of the generators failed to start. This outage took down yelp, Craigslist, Technorati, LiveJournal, TypePad, and many others. In this case, the failure was due to a bad setting in the Detroit Diesel Electronic Controller (DDEC) for the generators. The setting "was not allowing the component to correctly reset its memory.  Erroneous data left in the DDEC’s memory subsequently caused misfiring or engine start failures on the next diesel engine call to start."

That's right. The backup mechanism for power failure is a diesel generator and this diesel generator has a dependency on an electronic component, presumably with software running on it. I'm fairly ignorant of these diesel generators so let's assume that the DDEC has a redundant controller. Unfortunately, it would appear that the controller has a SPOF failure on a software setting -- or possibly a SPOF on a person that configured the DDEC setting.

I need a backup to my backup, which needs a backup...

There's a pattern here. In all four of these incidents, a primary system failed and there was a problem with the backup system or a system that the backup system relied on. More importantly, the systems that failed were regularly tested. Diesel generators, UPS units, chillers, and CRAC (Computer Room Air Conditioning) units are rigorously maintained and tested at every data center that I've worked in and yet these failures still happened.

Clearly, nodding your head when the colo sales person tells you that you’re getting into a tier-4 data center isn't enough due-diligence on your part. You need to understand a whole lot more about your data center infrastructure, design your physical site properly, and prepare for failure. The facility may be tier-4 but you can very easily wire all your critical systems to the same UPS and create a SPOF. You may even wire-up all of your racks with fully independent paths but then deploy your software across servers that have the same PDU.

Now, a lot of you are probably thinking that you're just out of luck if the building has a power failure and you run into a double or triple failure but I don't believe that's the case unless the building has completely lost power or cooling. In all of the failure cases above, note that only a percentage of the data center was down at any one time. You can avoid being down if you plan for failure across your entire SaaS infrastructure (facility, hardware, and software).

For example, if you assume that a PDU will fail (and include it as part of your failure analysis plan) you will verify that each power drop to your rack is from a separate UPS and that your service will stay up if a UPS battery bank explodes (yes, I’ve had this happen). You also won't assume that the receptacles to each rack are labeled correctly. You'll do a circuit breaker test. You won't even believe that the circuit breaker is rated appropriately and you'll do a power burn-in test before putting the rack into service to make sure you can hit 80% of the circuit without tripping a circuit breaker.

Why do I torture colo vendors with these tests and why do we monitor everything including temperature/humidity on every rack and temperature of every hard disk in the data center?

Because I've answered a phone call when a circuit breaker failed early, a CRAC unit had an airflow problem, rack power was partially lost after flipping the wrong circuit breaker because a circuit was labeled incorrectly, the batteries in a PDU exploded during a building utility breaker replacement, fuses blew when pushed into a not-so-forgiving rack PDU, redundant server power supplies started smoking after failing to failover, and an electrician accidentally turned off both circuit breakers to a rack on the launch of a major product (for the record, these circuits were labeled correctly, it was just a mistake on his part during maintenance).

This doesn't even include the environmental problems like CRAC unit failures or network problems like an ISP black-holing all of your traffic between data centers (happened multiple times now). The truth is, almost every data center experiences these small events a few times a year but most SaaS application owners are either too small to be impacted by these problems or they aren't monitoring well enough to even notice a small event.

Once you start getting into over 20 racks of gear or a few thousand hard disks, you’re going to start seeing regular hardware failures and you need to be prepared.

Now the good news

The good news is that you can protect yourself from many of these problems. In part two of this post, I’ll talk about categories of failures (electrical, environmental, network, hardware, and software), the components of each category, and provide a mitigation step for each component. In part 3 of the post, I’ll see if I can post a checklist for reference.

No comments:

Post a Comment

Thoughts?