Most people have heard the saying, “The shoemaker’s children go barefoot.” It certainly holds true that people often don’t practice what they preach. Everyone has heard the saying. Everyone has seen it happen. Yet nobody thinks it will happen to them. It will.
Most of you know me as the Director of Partnerships for Stack Moxie but we are a startup and, at the moment, we are all wearing many different hats. In addition to our partner program, I am also building our project management operations function. Which includes incident escalation and management.
I got the call at 2:21 pm. Our API went down. Of course these things happen to every company, but we give our clients the tools to prepare for when this happens. By 2:43, engineering had everything up and running. Problem solved….22 minutes down isn’t ideal but proud of our engineering team for fixing it so quickly. Except … the API went down at 7:42 pm THE DAY BEFORE. Details, details.
For those of you who are familiar with Stack Moxie, I know you are seeing the irony that a simple test, THE TEST THAT WE PROVIDE TO OUR CUSTOMERS FOR THIS VERY REASON, could have alerted us within a minute of our API going down.
Lucky for me, we have a remediation plan and a communications plan ready to go but it had been six months since the last incident. We knew exactly what needed to be done, but it wasn’t automated. Who owned each task? Should we notify all of our clients or just those that were affected?
After a brief conference call, we sent out an incident report to our customers who were affected. Even in a technical team used to these incident reporting process, it was crazy to see that almost everyone had hesitation in publishing all the gory details. It doesn’t feel good to anyone to admit we left our customer unmonitored overnight.
Rather than focusing on WHO didn’t set up tests or who didn’t fine tune our remediation plan, we concentrated on WHAT needs to be done in the future. First order of business today will be to get tests setup, decide how often to run these tests, and decide who should be notified when things fail. Basically, it is time to onboard Stack Moxie with Stack Moxie monitoring.
Failures WILL HAPPEN AGAIN. But not this fail. We’ve got the testing and monitoring in place now. We also took extra steps to automate our incident resolution process (we use Teamwork for project management and they’ve got a cool template function we are using). And, got the right logo and name on the incident report :)
Take it from me, if “monitoring your tech stack” is one of the MANY things on your list of things to do, at least get the biggest form or landing page done today. I mean, it is free. You can avoid that moment when you are investigating and start to uncover that it has been going on for months, and impacting real people. No one needs that fire drill.
Now, back to fine tuning that remediation plan……