Okay, so when I looked at Google news this morning, this story was in the “For you…based on your interests” section: Gremlin Brings Chaos Monkey Testing to Spinnaker CD Platform. Now, I’m a respecter of Dev Ops, but I’m not really a practitioner of Dev Ops. So I think this is cool, but the coolest thing about it is that my amazing wife invented Chaos Monkey. And now Dev Ops people all over the place are using it and excited about it and it’s really quite valuable. This, right here, is another reason that tech needs women.
Have I told the story of Chaos Monkey before? Maybe, but I’m old so I get to repeat myself. Remember, Netflix moved to the Amazon cloud because it experienced some severe, headline-making service outages. It was an economic decision, sure, but not of the, “AWS is cheaper than a data center,” kind but rather, “We need to be super resilient to hardware failure.” When Netflix started moving from a data center to the Amazon cloud, all of engineering received the mandate: move your systems to the cloud. The website needed to be operational and customers needed to be able to perform all their normal operations even if the data center (the source of truth) went away for a few days. I worked in the movie metadata team, and our brief essentially boiled down to all the movie data (title, synopsis, rating, cast and crew, etc.) and the inventory data (available formats and inventory levels).
There was this big meeting of people from all the engineering teams wherein we received this mandate and we were given a timeline as well as a set of conditions for the cloud. This was new territory for all of us. At the time, AWS had really only just started up as a public service and none of us really understood what the cloud environment meant. We were told:
- individual servers are transient
- spinning up a new server takes a couple of minutes but is otherwise cheap
- any server can be terminated at any time for no reason
This was a bit for me to absorb, given that I had lots of interaction with an Oracle database that was specifically not transient nor cheap. That night I went home and talked about it with my wife. I said that there was some concern that people would build their systems and assume that they were resilient by virtue of being in this cloud thing but that there would be some hidden dependency that would still cause problems. She said, what we needed was a monkey running around and randomly unplugging stuff; if the systems were designed properly, it wouldn’t make any difference.
Next big cross team meeting, this question of demonstrating resiliency came up again and I said, yeah, we need a chaos monkey with a gun, randomly putting bullets in servers. Everyone loved it. (This right here shows how Lise is a better person than I am. She suggested a playful imp and I framed it as a frightening existential threat.)
When I retired from Netflix in 2011, one of the guys gave me a Chaos Monkey. It’s the lovely parting gift I treasure most, because it reminds me that the really good ideas and the really important developments always involve Lise.
I had *not* heard that story – it’s awesome! I want to see a picture of the Chaos Monkey!