Blameless postmortems: Creating an honest and open culture
By: Vincent Oberle, Tech Lead at TransferWise
If you asked me what’s one of the most important pieces of our engineering culture that existed from the times we were 10 developers until now when we’re over 400, I’d probably answer our open culture and blameless postmortems. It’s safe to say that during my 7 years here, I’ve seen plenty of times when things have gone wrong.
When you build something like TransferWise, a global product built at such a fast pace, “screw ups” (or incidents as we now call them) are bound to happen. While screw-ups are normal, it’s crucial to learn from them to avoid repeating them. To help with learning from our mistakes, we write up what we call postmortems — deep dive reflections into what happened exactly, how we fixed it and how we can avoid the same happening in the future. In this blog I wanted to share a bit more about how this works.
What’s an incident at TransferWise? There are plenty of variants. The most obvious one is introducing a bug in production that requires an emergency release, or a rollback to the previous release. Sometimes incidents are caused by a problem by our partners. There are incidents that have direct impact on our customers, or incidents that affect only customer support for example. An incident may also not be related to any code, such as regulations changes that force us to close a currency. The list of possible incidents is infinite.
After an incident, we share a postmortem with the rest of the team. In it we explain:
- What was the impact (e.g. amount of customers impacted);
- The timeline of what happened;
- What was the root cause;
- How we detected and fixed it, and how long it took;
- And which steps we will take to avoid such problems in the future.
Depending on the type of incident, we share the postmortem write up with relevant teams (or sometimes even the whole company!) so they can all too learn from the case. This helps us to stay transparent with one another.
Postmortems help us to learn from our mistakes
The purpose of these postmortems is very important: We want the whole team to learn and improve, not do the same mistakes again.
For this to work, it’s essential for the postmortems to be blameless. Screw-ups are expected to happen. In fact, if they don’t, it means we’re not taking enough risks and are moving too slowly. We can’t afford that.
The key thing is that here nobody will blame you for an incident. It’s actually almost the opposite: A well-written postmortem with good action points will get you praise.
Postmortems also help us create an open environment of being honest with one another. If the postmortem doesn’t go to the point, you’ll be challenged. For example, sometimes there are “how we will prevent it from happening again” actions that obviously the team will never do. They would take too much time for too little benefit. We must be honest with ourselves: If an incident cannot be fully prevented from happening again, this should be acknowledged.
In addition to encouraging people to take risks, the blameless aspect is a must have for the transparency that our autonomous team culture requires. When humans are afraid of being blamed, they end up hiding problems, at the risk of creating even bigger problems. This then creates a toxic culture.
I think the focus we put these days on monitoring and alerting (or what we call more generally observability) comes also from all these years of diligently writing postmortem. They kept reminding us of the importance of being alerted early of issues and made us invest hugely in observability.
In addition to all these benefits, I personally find writing a postmortem to have a cathartic effect: It’s natural to feel bad when something has gone wrong, and the postmortem helps to get those emotions out and move on.
And for engineers, they’re great writing practice exercises. You need to explain succinctly a complex situation to people who don’t have all the context you have. It’s a nice opportunity to train those so important writing skills.
It’s things like blameless postmortems that make it great for people, old-timers and new-joiners alike to work at TransferWise.
P.S. Interested in working with us? We’re hiring! Check out our open Engineering roles here.