Creating a Company Incident and Remediation Plan

No company wants to spend its time thinking about failures—they’re a pain to deal with, and each failure that occurs equals a potential loss in revenue for your teams. So for many organizations, issues are fixed as needed and move on from as quickly as possible. While this may be the way to address failures with the least short-term stress, it’s ultimately going to lead to repeatable (and avoidable) mistakes that slow you down and make your processes messier.

Instead, it’s crucial for companies to establish an incident and remediation plan that clearly defines action items to follow in the event of a breakage. This plan will help your entire company jump into action when something critical goes down and keep downtime to a minimum. The investment of effort and time into an incident and remediation plan pays off when a failure inevitably happens, and you’re able to get it fixed without major losses to your revenue or other team operations. 

Designing an Incident Plan: Priority Status

When it comes to documenting incidents within your organization, categorizing the priority status of each type of failure is an important first step that will inform the rest of your incident and remediation plan. Prioritization is generally done based on the level of impact an failure has, as well as the volume of different failures that your company typically sees. Therefore, the higher the priority status,, the more urgent an incident should be to those responsible for a resolution.

  • P0 issues are the most urgent, signaling a system outage (i.e., your website has gone down or your system sync is failing). P0 status failures indicate that something critical has been impacted, and those issues take priority when deciding which bug to resolve first. 
  • P1 issues indicate high-value impact (i.e., a Contact Us form goes down, required fields on a form are missing, or your lead routing is broken). 
  • P2 issues refer to failures that impact end-users (i.e., content download goes down, or an email has broken links).
  • P3 issues signal a performance impact (i.e., pixel outages or operational monitoring).

The exact definitions of each priority status and what failures fit where will depend on your organization, but the point is to establish a hierarchy that clarifies which bug to handle first in the event that multiple issues are on the table simultaneously. 

Plan of Action When Something Goes Wrong

Now that you’ve determined the levels of priority status for failures in your company, these segments can be the foundation for action items when something breaks. Deciding when and who to notify about an outage makes it easier to reach the right people without creating unnecessary chaos that could delay a fix. 

  • For high-priority P0 issues, notify teams on every fail. All teams should be alerted so that everyone is ready to assist as needed, and response time should be immediate to get major failures under control as soon as possible. 
  • In general, P1 and P2 issues should receive notifications only when a scheduled outcome changes, and alerts should be reserved for the teams directly related to or affected by the incident. This allows affected departments to stay in the loop and change course as needed, without pulling in other teams that do not have a role in the event. 
  • Incidents categorized as P3 or lower should be addressed in weekly reports that document events not deemed critical by your organization. These reports should be accessible to anyone within the company but should most directly be relayed to the teams immediately related to the issue. 

Regardless of the priority status, every incident category should have an assigned escalation owner. This individual or small group will be responsible for overseeing the failure and ensuring it gets resolved in a timely manner. SLAs are also a crucial component of your incident report to align on an acceptable timeline for issues to be identified,acted on, and fixed. For P0 failures, this will likely be immediate. P1 and P2 failures may sit somewhere in the two to 24 hour range, while P3 and below bugs may be more likely to have a 3-7 day window. This clarifies how urgent your plan of action is and gives realistic expectations for teams and users affected by the outage or incident on when they can expect a solution to be found. 

Remediation Plans Are Crucial

While it’s great to have an incident plan that helps you make sense of events as they’re happening, a remediation plan to follow it is imperative. Even if most of your failures are low impact, having any recurrences will add up and drain your teams’ time and resources. While the incident plan focuses more on how you handle bugs, the remediation plan should let everyone know what steps were taken to address it and prevent it from happening again in the future. This should be viewed as an equal component and be required as part of any bug tracking methodology, not an optional add-on.

Most critically, a remediation plan should highlight the risk level of an incident (its priority status), a description of the failure and its impact, and a mitigation plan outlining the protocol for what will be done moving forward to prevent it. This can range from setting up a new monitor that will alert you when something larger is at risk of failing to reworking your entire process to be more user-friendly if you find that it’s causing a high volume of incidents.