Search this blog

Wednesday, May 5, 2010

Planning for SaaS Infrastructure Failure – part 1

There's nothing quite like a good Single Point of Failure (SPOF) during a holiday dinner. Most folks I talk to immediately think of network redundancy when I bring up this topic and tend to look at me strangely when I talk about other parts of their infrastructure. I thought I'd throw a short post together on how to plan for failure in your SaaS infrastructure but, as usual, it's turned into a much longer post than I've intended and I'm going to break this post into three parts. Here we go with part 1.

Friday, April 30, 2010

Tips for Handling Events, Incidents, Outages, and Maintenance

I get a lot of questions from new service teams about what they should do to prevent downtime but very few people ask for advice on how to handle an incident. This is a bit like asking a boxer for the best way to avoid getting in the ring. It’s not a question of “if” you’re going to be in the ring but “when”. There’s an old saying – the more you bleed in the gym, the less you bleed in the ring and that definitely applies to incident management as well.

Having sat in on more war rooms than I’d like to remember, I thought it might be handy to write down some of the things that my team has found useful over the years. I think every service organization should have a standard approach towards three specific activities:

1.    Tips for Handling Service Incidents (just one service)
2.    Tips for Handling Service Outages (multiple services affected)
3.    Tips for Handling System Maintenance

I hope these posts help you with your handling of incidents, outages, and maintenance. Success here is mostly about being prepared, being calm, good communication, and practice, practice, practice. If you think your service is bullet-proof and you won’t need the practice – you’re wrong :-)

Tips for Handling Service Incidents

This is the first post related to “Tips for Handling Events, Incidents, Outages, and Maintenance”.

Incidents with a single service are probably the most common type of incident and usually get resolved fairly quickly. Examples here are a blown power supply that results in server failover (or not), bad GBICs that result in dropped packets, daemons that crashed, etc.

The incident usually starts with a email, a text message, a phone call from the NOC, or one of your clients popping their head in your office with their hair on fire. You need to get your head straight and quickly run through a simple process:

Tips for Handling Service Outages (multiple services affected)

This is the second post related to “Tips for Handling Events, Incidents, Outages, and Maintenance”.

You’re about to have a interesting day/night. Multiple business-critical services are offline or having intermittent problems. It affects revenue, your company-wide-outage processes have been started, and you’re the lucky person on the on-call roster to lead the outage.

This is a different beast than resolving a problem with an single service. Here, you’re going to have to coordinate across several services and try to get the entire system up and running as soon as possible. The biggest obstacles here are coordination, communication, and discipline. This requires lots of practice before you get good in the role.


Tips and Tricks for System Maintenance

Tips and Tricks for System Maintenance

This is the third and final post related to “Tips for Handling Events, Incidents, Outages, and Maintenance”.

System Maintenance is a more relaxed version of the previous posts but there’s still a high probability that you might cause an incident during your maintenance. The difference is that you actually get to plan for it. System upgrades and maintenance can be very complicated and involve multiple teams doing many things in parallel. I’ve seen several maintenance periods where there wasn’t a clear plan and people were working on things when they weren’t asked to causing all sorts of problems (imagine your network team killing the firewall before you diverted services to another cluster) and this is never a pretty situation.

Monday, April 26, 2010

HBase Performance Post

The folks at hstack.org posted a great overview of their latest HBase performance testing. There are some heard-earned performance testing nuggets in the article along with interesting random read/write and map-reduce stats. They also have a small peek at their hardware configuration (lots of spindles!).

Thursday, April 15, 2010

SSD Primer and SandForce Controller

If you're new to SSDs or just want a fantastic in-depth review of the technology along with real-world testing, check out this anandtech.com article. If you're looking for a more recent analysis of SSD products, checkout this March article, also from anandtech.com.

The SandForce controller has been getting a lot of press lately, following on the heels of the Indilinx controller. Here's a review of a new Corsair SSD with the SandForce controller.

The SandForce controller takes a different approach than other SSD controllers and has introduced deduplication and the RAID concept to NAND cells. They call their RAID approach RAISE (Redundant Array of Independent Silicon Elements). Checkout more here.

nosql:eu conference (April 20-22) looks very interesting

I'm really looking forward to some of the slides and posts from the upcoming nosql conference in London.

The agenda is a drool-fest for any scalable KVP/metadata geek. I like how most of the talks are centered on real-world use of various technologies (Cassandra, CouchDB, Dynamo, HBase, MongoDB, Neo4j, Riak, etc.).

Here's a copy of the agenda:

Tuesday April 20
08.30  - Registration, Coffe and Mingle
09.30  - The Guardian's use of NoSQL - Matthew Wall, The Guardian
10.30  - Coffee break and mingle
10.50  - An overview of NoSQL - Alex Popescu, MyNoSQL
11.50  - Lunch break and mingle
13.00  - Key-value stores and Riak - Bryan Fink, Basho
14.00  - Coffee break and mingle
14.20  - Document-oriented databases and MongoDB - Mathias Stearn, 10gen
15.20  - Coffee break and mingle
15.40  - Column-oriented databases and Cassandra - Jonathan Ellis , Rackspace
16.40  - Coffee break and mingle
17.00  - Graph databases and Neo4j - Emil Eifrem, Neo Technology
18.00  - Evening party with loads of beer and mingle

Wednesday April 21
08.30  - Coffee and mingle
09.30  - On the Birth of Dynamo - Werner Vogels, Amazon
10.30  - Coffee break and mingle
10.50  - Twitter's use of Cassandra, Pig and HBase - Kevin Weil, Twitter
11.50  - Lunch break and mingle
13.00  - CouchDB at the BBC - Enda Farrell, BBC
14.00  - Coffee break and mingle
14.20  - Why Big Enterprises are Interested in NoSQL - Jon Moore, Comcast
15.20  - Coffee break and mingle
15.40  - Memory as the New Disk: Why Redis Rocks - Tim Lossen, Wooga
15.55  - Tokyo Cabinet, Tokyo Tyrant and Kyoto Cabinet - Makoto Inoue
16.10  - Thomas Kuhn Predicted the Fate of the Relational Database - Neil Robbins
16.25  - Notes from the field: NoSQL tools in Production - Matthew Ford
16.40  - Coffee break and mingle
17.00  - Panel debate - Moderated by James Governor, RedMonk

Thursday April 22
08.30  - Registration, Coffee and Mingle
09.00  - Morning workshops - Choose between:

 - MongoDB - Mathias Stearn, 10gen

 - Riak - Bryan Fink, Basho
12:30  - Lunch break and mingle
13:30  - Afternoon workshops - Choose between:

 - Redis - Simon Willison, The Guardian

 - Neo4j - Emil Eifrém, Neo Technology
17.00  - Thank you and see you next year!

Tuesday, April 13, 2010

NoSQL interview with the hstack.org folks

Another interesting interview with the hstack.org folks, this time from a NoSQL blog.

Friday, April 9, 2010

Great post on HBase and Adobe

Cosmin Lehene wrote a great 2-part post about his team's experience with HBase. Here are links to part 1 and part 2. I hear one of his partners-in-crime, Andrei, is working on another interesting post on performance testing related to this work.

FastFlow Parallel Programming Framework

I’ve been looking into Intel’s thread building blocks during the early morning hours here in Bucharest (jet-lag) and ran across an interesting library that provides non-blocking, lock-free, wait-free, synchronization mechanisms.

Check out this tutorial page with small code snippets and some sample pipelines/farms:

http://calvados.di.unipi.it/dokuwiki/doku.php?id=ffnamespace:usermanual

Here are some background links:

http://en.wikipedia.org/wiki/Fastflow_%28Computer_Science%29
http://calvados.di.unipi.it/dokuwiki/doku.php?id=ffnamespace:about

From the fastflow page:

“FastFlow is a parallel programming framework for multi-core platforms based upon non-blocking lock-free/fence-free synchronization mechanisms. The framework is composed of a stack of layers that progressively abstracts out the programming of shared-memory parallel applications. The goal of the stack is twofold: to ease the development of applications and make them very fast and scalable. FastFlow is particularly targeted to the development of streaming applications.”

From wikipeida:

“Fastflow is implemented as a template library that offers a set of low-level mechanisms to support low-latency and high-bandwidth data flows in a network of threads running on a cache-coherent multi-core.[1] On these architectures, the key performance issues concern memory fences, which are required to keep the various caches coherent. Fastflow provides the programmer with two basic mechanisms: efficient communication channels and a memory allocator. Communication channels, as typical is in streaming applications, are unidirectional and asynchronous. They are implemented via lock-free (and memory fence-free) Multiple-Producer-Multiple-Consumer (MPMC) queues. The memory allocator is built on top of these queues, thus taking advantage of their efficiency.”

AMD 12-core opteron versus 6 core xeon

I'd like to have seen a larger set of tests thrown at this one, but you have to love all the auto-enthusiast references in this anandtech.com review of the new 12-core Opteron versus the newer 6-core Xeon.

That's two, 6-core Instanbul chips bolted together. Reminds me a bit of the Pentium D with a much larger cache coherency problem (imagine how much of a problem this is going to be as we keep adding cores to chips).

New WD Velociraptor VR200M

WD has released their next generation VelociRaptor (10K RPM, 2.5" disk). It has a new 6Gbps interface and 600 GB of space. There's an interesting review comparing this disk versus a couple of non-enterprise SSDs here.

SuperMicro 24 core motherboard

Speaking of 24 core motherboards with loads of RAM, I ran across this new SuperMicro motherboard the other day when doing some research. It's truly terrifying how many cores and RAM you can toss onto one box now.

Assuming one core is dedicated to a Dom0, you could have 23 VMs each with a dedicated core and over 8GB or RAM if you add all 192GB of RAM.

Here are some specs from the link above:
  1. Quad Intel® 64-bit Xeon® MP Support 1066 MHz FSB
  2. Intel® 7300 (Clarksboro) Chipset
  3. Up to 192GB DDR2 ECC FB-DIMM (Fully Buffered DIMM)
  4. Intel® 82575EB Dual-port Gigabit Ethernet Controller
  5. LSI 1068e Dual Channel 8-Port SAS Controller
  6. 6x SATA (3 Gbps) Ports via ESB2 Controller
  7. 1 (x8) PCI-e (using X16 slot), 1 (x8) PCI-e (using x8 slot) & 1 (x4) PCI-e (using x8 slot) 1x 64-bit 133MHz PCI-X
  8. ATI ES1000 Graphics with 32MB video memory
  9. IPMI 2.0 (SIMSO) Slot 


OCZ PCI-e SSD with field-replaceable MLC NAND

OCZ is ready to mass produce it’s PCI-e SSDs with field replaceable MLC NAND flash modules.

This makes the MLC versus SLC debate a bit moot if you can just replace the NAND when it wears out like a bad disk. Did I mention that it has 8 separate Indlinx controllers, up to 2TBs of space, and has peak transfer rates of 1.4GB/s for reads and writes (that’s gigabytes not gigabits)? I can’t imagine what will happen with a Sandforce controller version of one of these monsters.

This is some seriously interesting temporary storage for a virtualization cluster that needs some fast DAS. With 2 TB, you could carve up 87 gigabytes for 23 VMs on a 24-core virtualization box. That’s mighty interesting.