This is the third and final post related to “Tips for Handling Events, Incidents, Outages, and Maintenance”.
System Maintenance is a more relaxed version of the previous posts but there’s still a high probability that you might cause an incident during your maintenance. The difference is that you actually get to plan for it. System upgrades and maintenance can be very complicated and involve multiple teams doing many things in parallel. I’ve seen several maintenance periods where there wasn’t a clear plan and people were working on things when they weren’t asked to causing all sorts of problems (imagine your network team killing the firewall before you diverted services to another cluster) and this is never a pretty situation.
As in the previous parts of this post, I highly recommend a second person and an ATC-like communication structure here to save time. I’ve probably taken this to an extreme, but I’ve gone with the flying/pilot/co-pilot metaphor:
Develop a flight-plan
Write down exactly what the goal of the maintenance is, what you’re going to do, and when you’re going to do it (down to the five minutes where each task is going to occur). This helps uncover any gaps in your maintenance (like forgetting to let existing connections finish before restarting a server).
Plan for Bingo
“Bingo” is the point in a flight where you have just enough fuel to get back to the airport. Pick a time that your maintenance has to be done or your rollback. This is mostly for disruptive maintenance but it’s also useful as part of a big upgrade. If you haven’t gotten it right by a certain time, you have to ask whether it’s better to roll-back or whether it’s too dangerous to rollback. The co-pilot role is responsible for enforcing “bingo”
Do a Preflight
Review the maintenance at least one day prior. Everyone should be clear on the order of events and their role
Make sure you’re all at the same airport In this case, make sure you document which data center and which cluster you’re working on. Don’t assume. Have everyone confirm that they are in a shell in the correct data center and cluster.
Start the clock and the announcements so the co-pilot can track progress against the plan and send updates
Read each line of the flight-plan before asking someone to do something. People can jump ahead and accidentally do a step out order. For complicated maintenance, you need to coordinate each activity.
Consider a single control point if your maintenance is complicated with lots of people and coordination. In this model, only one person has the airplane at a time. This prevents multiple people from doing maintenance that interferes with each other. For example:
Pilot: “Storage, please cutover to the new cluster. You have control”.
Storage Team: “Cutting over to the new cluster. I have control”
[The pilot then does the normal five minute updates as with an outage or refers to the expected time on the flight plan and ask the storage team for an update if that time expires]
[Later, when the storage person is done]
Storage Team: “Storage cutover complete. Returning control”.
Pilot: "I have control. Network team, ..."
This is all very overkill with simple upgrades but, with complicated upgrades, I’ve seen this reduce confusion and prevent problems. In the example above, nobody is allowed to do any other type of maintenance while the storage team has control (think of it as a critical section for maintenance).
On the other hand, sometimes you just don’t have this luxury due to time constraints and you need to run things in parallel. There’s nothing wrong with running things in parallel as long as you’ve planned appropriately and understand the risks.
Follow a Landing Checklist
Always follow the appropriate checklist for restarting a service. This can often get skipped after maintenance if you’ve had an interesting upgrade. But, there’s nothing worse than forgetting to turn off a “maintenance” page and getting woken up hours later to find out that you caused multiple hours of downtime by missing this simple check.
Finally, if your flight-plan is complicated – automate it! Run scripts instead of doing many manual steps.