How To Improve Alert Response Times
Response times are part of the critical path to a robust production system and are one of the key factors in an efficient support operation. Slow response times result in downtime, which ends up in lost time, money, and often customers too.
There are a number of methods that, when combined, help you improve response times. Here we’ll take a look at some top tips for improving response times through an intelligent approach to notifications.
Writing Effective Alerts – Things to Consider
Our first method to reduce response times is to generate well-written alerts. If your alert notifications give your support team a clear understanding of an issue, the team can prioritize important alerts and focus on them. Here are my top tips to writing alert notifications that your teams will respond to:
Tip #1 Clear and present danger: When creating your alert notification template make sure you write a very clear description so that your team, even blurry-eyed in the middle of the night, can understand the importance level.
Tip #2 In summary…: Setup a focused summary of all of the possible cause based rules. It needs to be precise and brief – enough to allow the reader to skim and spot if the cause is already a known issue.
Tip #3 Do the work for them: If you know that certain alerts will require certain information to resolve the issue, add that information in to the body of the alert for reference. Links to your favorite knowledge management system or runbook will be highly appreciated by the on-call person.
Tip #4 No such thing as too much info: Add in additional information such as internal Wiki links and other ticket responses that are pertinent to the notification.
How to Use Policies to Improve Alert Response Times
Preparing notifications using pre-completed information is one way to improve response times. Another is to apply policy settings to notification generation. The top 4 tips for the type of policies which help to cut response times are:
Tip #1 Be negative: Allow your teams to negatively acknowledge alerts. This creates a domino effect and moves the alert onto the next person.
Tip #2 Be a thoughtful fighter: The fighter pilot, John Boyd developed a framework for thinking known as OODA or ‘observe, orient, decide and act’. It is a process based method of thinking about a problem and can be applied effectively to your on-call engineers.
Tip #3 Aggregation for information: Create incident threads from related incoming incidents by aggregating them into a single thread. It makes notifications more clear and relatable.
Tip #4 Silence of the logs: Silence redundant alarms and log them.
Protocols and Response Time Optimization
Optimizing alert notifications to improve response times using clear outlines and sound policies can be augmented by choosing the right channel protocol to pass those alerts through. The types of channels used for communicating alerts vary in the immediacy of the method. Channel protocol choices in order of responsiveness (and annoyance) are:
Dashboard
Email
Chat (Slack/HipChat etc.)
SMS/Page
Phone call
Which you choose as your protocol during configuration will impact the response time. The balance between getting the right level of response / response time, and preventing the burden / annoyance of a false alarm, can be accommodated by good testing. Test against the history of the alert and start off with a more passive protocol for communication of that alert – if it ‘behaves’, and doesn’t generate false positives, then it can go up the protocol ladder to the next channel level.
Alert Taxonomy
Taxonomy is all about classification, and classification is all about making things easier to understand allowing us to see patterns and shared characteristics.
Alerts can be classified into several areas:
Severity levels
Alert states
Alert notification criticality
A miscellaneous sub-set that classifies unactionable alerts
Having a coherent alert taxonomy can give you a tool to apply policies and protocols, making your overall alert system well designed, effective, with appropriate response times that optimize system uptime. This ensures teams are responding to the right alert, at the right time, in a timely manner.