There’s an interesting article on Information Week about Netflix’s Chaos Monkey, a bit of code they recently made available to the public. Chaos Monkey is a piece of software that breaks stuff so that engineers can see what happens when it does.
It’s not a simulation or a test run on a duplicate of the real thing. There’s no reset button. They have introduced a gremlin into their machine (between the hours of 9am and 3pm) that tests their systems beyond the breaking point. These guys are taking one of the most critical components of their service delivery and messing with it so they can make their business better. They’re searching for problems they wouldn’t know to look for by introducing an unnecessary random element. This takes a rather radical point of view. And balls. Brassy ones. You need to be ok with failure. You need to recognise, accept and anticipate that your work is flawed, and that those flaws are going to be exposed in ways you can’t anticipate. Most important, you need to be ready to FIX them. Not analyse them, or do a post mortem, or an accountability diagnosis. Something is broken and needs to be repaired. Now. Worry about the who and why later.
Think about fire. It’s scary, and unpredictable, and fast. We have a plan for when the situation arises, and detection systems and prevention devices and we run drills so everyone is familiar with what to do when the time comes. But we don’t set fires, disable alarms and sabotage equipment. That would be crazily irresponsible.
But maybe we should. Because it isn’t really about testing our software, our plans or our equipment. It’s about people, and equipping them to handle a crippling problem they could not anticipate on their own. And that’s not crazy at all. That’s reality.