We’ve now had two outages in six days. The first incident was a small one and our fault. Our API went down for Marketo users sending us live test leads. Because it sat between web team, product, and customer success, no one owned it and no one had a monitor on it. The second incident was our identity vendor’s major outage, which took our login capabilities down with them.
This has brought our Incident (or Live Site Incident) and Postmortem process into sharp focus. Our CMO, Adam, wrote an awesome post about blameless post-mortem cultures.
It has also made it crystal clear to me that communicating an outage, and receiving that communication, is contrary to human nature. The plain truth is that #failshappen every day. At the pace at which we are innovating – as Stack Moxie or as the business owners of a techstack – we are destined to miss something. The best we can hope to do is catch it early, communicate it quickly, and learn from it.
Build Your Own Live Site Incident Process
As a part of our dedicated onboarding process for new large and enterprise customers, we do a session that walks them through building their own Live Site Incident process. When you start testing and monitoring, you find more things wrong. Most of them are so minor they don’t warrant communication. But sometimes, customers or prospects are impacted.
When something big goes wrong, everyone will have a different reaction in the stress of the moment. And very few people will respond as you (or they) would predict. Denial and finger-pointing are an incredibly common knee-jerk reaction – and that is totally natural.
Having a playbook is critical – kind of like muscle memory for #fails. The trick is having a procedure that removes blame, denial, and finger pointing. A procedure that drives to root cause, communication, and remediation. (FYI – if someone is ducking blame or finger-pointing, that is impossible).
Report Your Incident to Find a Resolution
The impact of reporting isn’t obvious, and it isn’t small. Across a small team, the impact is minor. If two people find the error independently, it is probably just a few hours wasted. Across a company, if an outage affects multiple teams, the impact is greater and the loss of trust greater. If it is across your entire customer base, the impact is substantial. During the Auth0 outage yesterday, there was so much confusion that Auth0’s status page went down. For the first two hours, they said they were “monitoring intermittent errors” – and users took to Twitter to complain of gaslighting.
That is exactly how your stakeholders feel – whether internal or external – when there is something broken and no one acts or takes responsibility. Rapid and consistent incident identification and communication – even without rapid resolution – solves those problems.
There is an incredible benefit to publishing a process, following it, and iterating on it. Especially in a corporate environment, eventually someone will take issue with something that was executed in a crisis. With a published process, the process turns from pointing fingers to making edits to the published approach. “Why wasn’t I told earlier” is answered simply with “Here are the notification levels we published 8 months ago. Would you like to be switched to the distribution list that is notified more frequently instead?”
3 Tips for Better Internal Communication
1 – Fast is better than pretty or perfect. The marketer in me hates this. But that’s why our template is an excel spreadsheet that is designed to be screenshot and sent out. The template in our email tool has an iPhone style disclaimer at the bottom – **Please excuse any typos or spelling mistakes. These notifications are sent as quickly as possible to keep you informed.**
2. Start small. We start from one of two LSI templates. One is simple. Anyone can get value out of it. The second is a lot heavier. It’s based on the template I built when working at Microsoft Office building a “Center of Excellence” for marketing systems. Get something committed and “published.” With each incident, you can make incremental improvements.
3. Every failure is an opportunity. Especially for managers early in their career, recognize that calm consistency in the face of a challenge is always recognized. Even if your leadership doesn’t buy into the process or invest in building the template ahead of time, do it for yourself and have it on hand. As the role gets bigger, and the audience wider, invest in executive-level communication techniques when you lead a retrospective or meet with executives for cross-functional support. We’ve got a great template for communicating live site incidents here