With Clarity Comes Focus: How to Reduce SRE Team’s Cognitive Load

With the success of the Google SRE book, more and more teams are transitioning from a traditional sysadmin/”ops team” model for managing production incidents towards the SRE ideal - an empowered, engineering-focused team that spends at least 50% of their time on creative work that helps teams move towards more automated and sustainable infrastructure. But cutting down to half of the current time your team spends fire-fighting and doing other traditional “ops” tasks isn’t easy - it requires bold long-term vision, strategic planning, and the right tools for your team. 

With this challenge in mind, a natural inclination for SRE leaders is to go searching for data - how rough is the situation right now? Where do their teams measure up in terms of some common SRE KPIs, like mean time to understand (MTTU) and mean time to respond (MTTR) to incidents? How frequently are their team members woken up at night with an ops problem?

Luckily, this information is probably recorded in teams’ existing incident management platforms  - but busy SRE teams don’t have a ton of time to comb through it. This is why many AIOps and incident management tools have introduced complex analytics features, happily churning out histograms and pie charts for curious team members to peruse.

Although data visualization is important, in production systems with lots of noisy alerts, it may not be as helpful as we’d hope: Underlying issues can hide in “good” data.

Let’s use mean time to acknowledge as an example. A decrease in MTTA, which is easily determined from the per-incident statistics from your platform, could look like improvement for an SRE team. But maybe the reason for this drop is that some of your systems have started producing some flapping alerts, leading the team to just ack as soon as they’re paged - the SRE equivalent of subconsciously hitting the snooze button or mass-deleting emails. In order to find this truth, you’ll have to dig deeper into the data or talk to a team member.

It’s hard to maintain a single source of truth

Most modern production systems include a bevy of tools that all have their own definitions of “incidents” or “alerts,” making it tricky to understand analytics that that include data points across multiple sources. For example, let’s look at a simple metric - the number of incidents a team sees per week. Does this include:

  • Every alert that triggers in a monitoring system?

  • Every incident a team member is paged for?

  • Repeats of the same alert/incident?

  • Alerts that resurface after a snooze period?

  • Multiple incidents that occur at the same time for one underlying issue?

  • Only incidents that actually require human action?

Chances are, analytics for the “number of incidents” are going to mean different things in each tool, and any one of those numbers may or may not be the most meaningful choice for a team.

Your team has the most valuable perspective

One of the foundational principles of Kaizen, the Japanese principle of continuous improvement, is that team members at all levels should be involved in the evolution of process. This is because even with the fanciest analytics platform in the world, the people doing day-to-day work still have the greatest insight into the hangups and struggles of the process. Using incident analytics to establish a baseline is a great idea, but the more nuanced steps toward SRE greatness don’t require fancier graphs - they just require talking to the team. 

A clearer picture

Built-in tools to help you visualize your data can be useful, but in a noisy production environment, an overload of graphs and charts will actually make life harder. An underlying reason for the potential issues with an analytics-first approach to SRE process improvement is that most production systems still have lots of noisy alerts.

Instead of looking for new ways to visualize the noise, what if there was a way to create greater clarity at the alert level first?

Through layers of machine learning-driven filters and correlation logic, SignifAI helps consolidate and streamline incidents across all of your platforms so your team only gets paged for what matters.

With this clarity and context, you’ll be able to answer questions about your team’s progress and production environment health much more quickly, and spend less time pushing through the noise.

DevOps Tools - Something is Still Wrong...

Although companies have invested heavily in automating specific portions of their DevOps pipeline in the last 10 years, everything from build and test, log management, performance monitoring and incident management, the common refrain we hear is that despite being able to ship more code, faster and with fewer defects, their system and application availability has not moved dramatically in a positive direction. Why is that?

More often than not, this is because the engineers that are more “Ops” than “Dev” in an organization (those actually responsible for running code in production) are constantly “fighting fires.”

Why are teams “fighting fires” vs working on tasks that could dramatically improve uptime? They are hampered by the inefficiencies introduced by the very DevOps tools which are supposed to help them. A lack of integration, visibility and correlation between the tools forces engineers to waste time on tasks that could be automated but lack the tools to make it so.

More specifically, here are a few examples in a typical incident management workflow where there exists plenty of opportunities to automate…

  • Dealing with “alert noise”.

  • Searching for and correlating the relevant logs, events and metrics in order to create impactful solutions based on informed root cause analysis.

  • Searching for known solutions you know are applicable to the current issue but just can’t locate them efficiently.

  • Analyzing data, patterns and behaviors to understand what sort of preventive fixes could be implemented today that would dramatically improve uptime tomorrow.

Alert noise

First, “what is alert noise?” Alert noise is that thing you experience when you see your Slack channel or phone clogged with informational, test, duplicate, unprioritized and irrelevant alerts.

To easily understand how “alert noise” affects system and application availability answer this simple question:

What percentage of the alerts do you receive on a daily basis that you end up immediately dismissing or spend some time debating whether or not to ignore?

Now, do the math. If an SRE on staff spends even 20 mins a day acknowledging, prioritizing, researching and ultimately dismissing alerts, that’s at least 87 hours a year wasted. Multiply that by the number of SREs on staff and ask yourself, “What could an engineer be working on during those 87 hours if it wasn’t spent ignoring or dismissing alerts?” The answer is likely, “Working on the things we know will dramatically improve our uptime, but we simply don’t have the time to work on because we are constantly fighting fires.”

What can be done to reduce alert noise?

One option, which a few incident management tools recommend, is to manually analyze your alert history and then create and maintain an exhaustive rules engine. This is definitely not a “set it and forget it” solution. Instead one that requires constant tweaks to account for changes to your environment in order to consistently reduce alert noise.

Another option is to leverage some sort of an AIOps platform that makes use of machine learning and AI to analyze your alerts historically and in real-time to automatically correlate, roll-up and prioritize your alerts and events. Because the algorithms employed constantly adapt to changes in your incident workflow, there’s no need to constantly be tuning a rules engine.

At the end of the day, your team is already attempting to reduce alert noise in one way or another, so why not let a machine do it more efficiently and more accurately so you can focus on more important issues?

The problem with root cause analysis

The next area ripe for automation in the incident management workflow is going to be optimizing the time spent searching for and correlating all the relevant logs, events and metrics that pertain to an alert. This process is vital for conducting informed root cause analysis. Why? Because ideally you want to put into production impactful and permanent solutions vs temporary fixes. To do this requires getting full context and as much visibility as possible into all the relevant data that pertains to the alert. Engineers are generally not satisfied with “bounce the server and hope the problem goes away” type of solutions and would rather invest the time to do proper root cause analysis.

However, when you are busy fighting fires, the luxury of performing detailed root cause analysis isn’t always possible. As tempting as it is to just restart a service or redeploy a container with additional resources, what if you could easily determine that the true cause of an issue was a misconfiguration in Ansible or poor test coverage in Jenkins? These types of solutions reach back further into the DevOps toolchain to address problems that will sooner or later manifest themselves again.

AIOps platforms makes it easy to get all the relevant data (whether it be logs, events or metrics) inside the alert itself so you don’t have to context switch between tools, download files and cut and paste data into spreadsheets or a doc for analysis. 

So far we’ve managed to save time by reducing alert noise and automating the process of getting all the relevant data into the alert itself, what’s next? How about saving time the time spent searching for solutions you know you’ve implemented in the past that could be used to solve the alert you are actively triaging?

Avoiding “reinventing the wheel”

One of the things all engineers are stereotypically bad about is documentation. SREs are no different when it comes to documenting solutions they’ve deployed. Even if the solutions are documented they are often in a format that doesn’t lend themselves to being easily indexed, searched and surfaced for reuse at a later date.

Ask yourself:

“How often have I wasted time hunting down solutions I know exist but our ‘hidden’ in a Slack conversation, email thread, Wiki, runbook or worse someone else’s head?”

Here is an example where SignifAI makes it easy to document solutions using a wizard driven knowledge base creation system.

uptime-3-610x372.png

SignifAI automatically correlates solutions from the past or industry best practices with alerts happening in real-time and surfaces them with a probability score.

uptime-5-610x786.png

Machine learning is only going to be as good as the data set it has to work with (think logs, events and metrics vs just metrics) and how much supervised learning can be applied to the recommendations that the algorithms produce. SignifAI has made it easy for users to vote up or down and provide feedback to any solutions it recommends. Ultimately, this means that the more you interact with the recommendation engine, the greater the accuracy of recommended solutions in the future. More accuracy means less time spent looking for solutions you know exist.

Become proactive vs reactive in your approach to uptime

So, at this point if you’ve succeeded in reducing alert noise, the time spent gathering data and hunting down fixes…what are you going to do with all this extra time you’ve now gained?

“If we had the time to rearchitect our application, 80% of our problems would go away.”

“If we had the time to break up our monolithic service into microservices, we could minimize the impact of problems when they occur.”

When engineers are given additional time in their day, they will inevitably want to work on creative and interesting problems that have a far greater chance of actually improving uptime then the status quo of “fighting fires” ever will.

I believe that AI and machine learning are the perfect technologies to help automate tasks your team takes no pleasure in doing and at the end of the day, is not a good use of their time.

How To Improve Alert Response Times

Response times are part of the critical path to a robust production system and are one of the key factors in an efficient support operation. Slow response times result in downtime, which ends up in lost time, money, and often customers too.

There are a number of methods that, when combined, help you improve response times. Here we’ll take a look at some top tips for improving response times through an intelligent approach to notifications.

Writing Effective Alerts –  Things to Consider

Our first method to reduce response times is to generate well-written alerts. If your alert notifications give your support team a clear understanding of an issue, the team can prioritize important alerts and focus on them. Here are my top tips to writing alert notifications that your teams will respond to:

Tip #1 Clear and present danger:  When creating your alert notification template make sure you write a very clear description so that your  team, even blurry-eyed in the middle of the night, can understand the importance level.

Tip #2 In summary…: Setup a focused summary of all of the possible cause based rules. It needs to be precise and brief – enough to allow the reader to skim and spot if the cause is already a known issue.

Tip #3 Do the work for them: If you know that certain alerts will require certain information to resolve the issue, add that information in to the body of the alert for reference. Links to your favorite knowledge management system or runbook will be highly appreciated by the on-call person.

Tip #4 No such thing as too much info: Add in additional information such as internal Wiki links and other ticket responses that are pertinent to the notification.

How to Use Policies to Improve Alert Response Times

Preparing notifications using pre-completed information is one way to improve response times. Another is to apply policy settings to notification generation. The top 4 tips for the type of policies which help to cut response times are:

Tip #1 Be negative: Allow your teams to negatively acknowledge alerts. This creates a domino effect and moves the alert onto the next person.

Tip #2 Be a thoughtful fighter: The fighter pilot, John Boyd developed a framework for thinking known as OODA or ‘observe, orient, decide and act’. It is a process based method of thinking about a problem and can be applied effectively to your on-call engineers.

Tip #3 Aggregation for information: Create incident threads from related incoming incidents by aggregating them into a single thread. It makes notifications more clear and relatable.

Tip #4 Silence of the logs: Silence redundant alarms and log them.

Protocols and Response Time Optimization

Optimizing alert notifications to improve response times using clear outlines and sound policies can be augmented by choosing the right channel protocol to pass those alerts through. The types of channels used for communicating alerts vary in the immediacy of the method. Channel protocol choices in order of responsiveness (and annoyance) are:

  • Dashboard

  • Email

  • Chat (Slack/HipChat etc.)

  • SMS/Page

  • Phone call

Which you choose as your protocol during configuration will impact the response time. The balance between getting the right level of response / response time, and preventing the burden / annoyance of a false alarm, can be accommodated by good testing. Test against the history of the alert and start off with a more passive protocol for communication of that alert – if it ‘behaves’, and doesn’t generate false positives, then it can go up the protocol ladder to the next channel level.

Alert Taxonomy

Taxonomy is all about classification, and classification is all about making things easier to understand allowing us to see patterns and shared characteristics.

Alerts can be classified into several areas:

  1.       Severity levels

  2.       Alert states

  3.       Alert notification criticality

  4.       A miscellaneous sub-set that classifies unactionable alerts

Having a coherent alert taxonomy can give you a tool to apply policies and protocols, making your overall alert system well designed, effective, with appropriate response times that optimize system uptime. This ensures teams are responding to the right alert, at the right time, in a timely manner.