It’s the thing everyone hates to hear the most — the system went down.
In our case, last week for several hours overnight, one of our APIs went offline affecting a percentage of our clients. Once we realized the service was offline, it was fixed within 22 minutes.
Luckily we had a plan already in place and the team lept into action. A notification was sent to our user community notifying them of the outage. We updated the website and shared a blog post about the experience. It was great to know that we already had a plan for what to do, but to make this a learning moment for the team, we need to conclude the experience with a post-mortem review.
A lot has been written about creating blameless postmortems, including a blog from Stack Moxie’s CEO. One of the challenges with traditional post-mortems is that at the end of the day — someone did something wrong — and someone fixed it. So a review of what happened ends up feeling like finger pointing — not learning.
To create a blameless meeting – start by removing the personal elements to why a failure happened. It wasn’t “Jim didn’t update the APIs properly” “The API urls were improperly configured in the DNS server without testing validating the work was properly configured.”
By abstracting the individual from the action, we can focus on the desired outcome rather than causing a defensive posture.
Our meeting started with a reminder on being blameless. What it means to not point fingers and focus on the facts. The moderator of the meeting should set the tone at the start of every postmortem to refresh everyone’s memory. Strive to not make failures personal, even if they were your own shortcomings. Don’t say “I failed”, discuss the failure itself instead.
Our moderator set the tone with the three key topics to be discussed.
- What happened
- What went right
- What to do from keeping it from happening again
In our case, the challenge boiled down to a lack of coordination with an external vendor. It was an easy solve. Plan a meeting to roll-out critical changes, don’t parse them asynchronously. We were proud of what had gone right — most notably our proactive & immediate communication and notification to our customer community.
But the team didn’t stop with just what had happened and what went right. We spent the majority of the postmortem brainstorming ways to prevent the failure from happening again.
Our largest takeaway was a shared definition of “what is done”. When you launch a new feature or marketing campaign – when is it truly done? We changed our internal definition of done to be “launched a monitor for ongoing testing”. We agreed that nothing is actually complete until the system will proactively notify us of any service outages.
We spent a few minutes discussing a newly created application monitoring status page. By the end of the meeting we had launched a new status page so that anyone could see any future outages and global latency times. It will also be the place that we house our post-mortem recaps so that everyone can transparently see what happened and how it was solved.
We finished up by documenting the postmortem notes in our incident and remediation plan so that we are continuously learning. We felt good as the meeting ended & we went on with the rest of our day. No more outages here.
Until the next day.
This time the outage was outside of our control. A popular identity vendor’s user authentication service went down. But this time we were ready. When the fire alarm went off, everyone stopped what they were doing and jumped on a coordination & triage call.
We had a notification sent to our customers lightning fast — under 45 minutes — much faster than the vendor communicated with us. We let them know that their monitoring was still actively running, they just could not temporarily login to the platform. We followed back up when services were restored to let them know all issues had been resolved. The vendor never communicated with us at all.
The postmortem for that outage is next Tuesday. Let Shea know if you’d like to attend or receive the remediation notes.
Is your incident and remediation plan up-to-date and accurate? If not, download ours to ensure that when an outage occurs you’ve got a plan in place for what to do. #failshappen how you react separates the good from the great.
Everyone can see Stack Moxie’s application status by going here.