What do we test before prod? We do our known unknowns -- does it work? (unit tests). does it fail in ways I can predict?— Liz Fong-Jones (方禮真) (@lizthegrey) June 13, 2019
We need to test our unknown unknowns in production with ✨observability✨. and experiment upon them with chaos engineering! #VelocityConf
I got inspired by Liz’s tweet recently, and I am writing this post as a reminder for everybody. “Test in prod” is a slogan, a trademark. It doesn’t explain all the concepts behind a sentence as “things bo getter with Coke” hides why or how. Slogans are great as a quick reminder for more articulated ideas. They are useful because in one sentence you can recall to more profound contents inside your brain.
“You” do not test unknown unknowns in production, mainly because you do not know your unknowns. In production, you as a developer validate three kinds of things:
What “test in prod” means is the fact that somebody, a random customer human or not randomly will trigger an unknown action that will cause an issue. It doesn’t even be to be triggered, it can be an environmental issue. For example, what Twitch call “the refresh storm” is an excellent example of an environmental issue. When a broadcaster has a connectivity issue, all the watcher starts to refresh the page multiple times thinking to solve the problem. As a side effect, the Twitch infrastructure can suffer about a high number of requests. This is a no-Twitch problem that becomes a Twitch problem.
We need to learn and onboard tools and mindset that will help us to improve how fast we can track, record, fix, and learn from an issue. All the question that matters happens in production, and by consequence, we need to stay focused on it. I think a lot of people test in prod in some way.
When your laptop starts, but it restarts by itself after some point you have a problem. You look around, and you notice that your fan doesn’t run anymore. It is a pretty simple issue to solve and detect. You hear that the fan doesn’t make any noise, so you replace it.
I am sorry! Everybody got distracted by distributed system, containers, cloud. 90% of our failures if you know how to design a fault tolerance application are a partial failure! They are a disaster to figure out, understand, and fix! Only a subset of our system may break, for a subset of customers, but the same part works correctly for another subgroup, and you need to figure out why! You should also be able to message that subgroup of customers to say “I am sorry! Shit happens, we are working on it”, proactively!
“test in prod” means all the things I wrote and probably way more! It is reasonable to say that nobody can do anything to avoid “test in prod” to happen, so have fun!