Guy Fighel

View Original

DevOps Tools - Something is Still Wrong...

Although companies have invested heavily in automating specific portions of their DevOps pipeline in the last 10 years, everything from build and test, log management, performance monitoring and incident management, the common refrain we hear is that despite being able to ship more code, faster and with fewer defects, their system and application availability has not moved dramatically in a positive direction. Why is that?

More often than not, this is because the engineers that are more “Ops” than “Dev” in an organization (those actually responsible for running code in production) are constantly “fighting fires.”

Why are teams “fighting fires” vs working on tasks that could dramatically improve uptime? They are hampered by the inefficiencies introduced by the very DevOps tools which are supposed to help them. A lack of integration, visibility and correlation between the tools forces engineers to waste time on tasks that could be automated but lack the tools to make it so.

More specifically, here are a few examples in a typical incident management workflow where there exists plenty of opportunities to automate…

  • Dealing with “alert noise”.

  • Searching for and correlating the relevant logs, events and metrics in order to create impactful solutions based on informed root cause analysis.

  • Searching for known solutions you know are applicable to the current issue but just can’t locate them efficiently.

  • Analyzing data, patterns and behaviors to understand what sort of preventive fixes could be implemented today that would dramatically improve uptime tomorrow.

Alert noise

First, “what is alert noise?” Alert noise is that thing you experience when you see your Slack channel or phone clogged with informational, test, duplicate, unprioritized and irrelevant alerts.

To easily understand how “alert noise” affects system and application availability answer this simple question:

What percentage of the alerts do you receive on a daily basis that you end up immediately dismissing or spend some time debating whether or not to ignore?

Now, do the math. If an SRE on staff spends even 20 mins a day acknowledging, prioritizing, researching and ultimately dismissing alerts, that’s at least 87 hours a year wasted. Multiply that by the number of SREs on staff and ask yourself, “What could an engineer be working on during those 87 hours if it wasn’t spent ignoring or dismissing alerts?” The answer is likely, “Working on the things we know will dramatically improve our uptime, but we simply don’t have the time to work on because we are constantly fighting fires.”

What can be done to reduce alert noise?

One option, which a few incident management tools recommend, is to manually analyze your alert history and then create and maintain an exhaustive rules engine. This is definitely not a “set it and forget it” solution. Instead one that requires constant tweaks to account for changes to your environment in order to consistently reduce alert noise.

Another option is to leverage some sort of an AIOps platform that makes use of machine learning and AI to analyze your alerts historically and in real-time to automatically correlate, roll-up and prioritize your alerts and events. Because the algorithms employed constantly adapt to changes in your incident workflow, there’s no need to constantly be tuning a rules engine.

At the end of the day, your team is already attempting to reduce alert noise in one way or another, so why not let a machine do it more efficiently and more accurately so you can focus on more important issues?

The problem with root cause analysis

The next area ripe for automation in the incident management workflow is going to be optimizing the time spent searching for and correlating all the relevant logs, events and metrics that pertain to an alert. This process is vital for conducting informed root cause analysis. Why? Because ideally you want to put into production impactful and permanent solutions vs temporary fixes. To do this requires getting full context and as much visibility as possible into all the relevant data that pertains to the alert. Engineers are generally not satisfied with “bounce the server and hope the problem goes away” type of solutions and would rather invest the time to do proper root cause analysis.

However, when you are busy fighting fires, the luxury of performing detailed root cause analysis isn’t always possible. As tempting as it is to just restart a service or redeploy a container with additional resources, what if you could easily determine that the true cause of an issue was a misconfiguration in Ansible or poor test coverage in Jenkins? These types of solutions reach back further into the DevOps toolchain to address problems that will sooner or later manifest themselves again.

AIOps platforms makes it easy to get all the relevant data (whether it be logs, events or metrics) inside the alert itself so you don’t have to context switch between tools, download files and cut and paste data into spreadsheets or a doc for analysis. 

So far we’ve managed to save time by reducing alert noise and automating the process of getting all the relevant data into the alert itself, what’s next? How about saving time the time spent searching for solutions you know you’ve implemented in the past that could be used to solve the alert you are actively triaging?

Avoiding “reinventing the wheel”

One of the things all engineers are stereotypically bad about is documentation. SREs are no different when it comes to documenting solutions they’ve deployed. Even if the solutions are documented they are often in a format that doesn’t lend themselves to being easily indexed, searched and surfaced for reuse at a later date.

Ask yourself:

“How often have I wasted time hunting down solutions I know exist but our ‘hidden’ in a Slack conversation, email thread, Wiki, runbook or worse someone else’s head?”

Here is an example where SignifAI makes it easy to document solutions using a wizard driven knowledge base creation system.

SignifAI automatically correlates solutions from the past or industry best practices with alerts happening in real-time and surfaces them with a probability score.

Machine learning is only going to be as good as the data set it has to work with (think logs, events and metrics vs just metrics) and how much supervised learning can be applied to the recommendations that the algorithms produce. SignifAI has made it easy for users to vote up or down and provide feedback to any solutions it recommends. Ultimately, this means that the more you interact with the recommendation engine, the greater the accuracy of recommended solutions in the future. More accuracy means less time spent looking for solutions you know exist.

Become proactive vs reactive in your approach to uptime

So, at this point if you’ve succeeded in reducing alert noise, the time spent gathering data and hunting down fixes…what are you going to do with all this extra time you’ve now gained?

“If we had the time to rearchitect our application, 80% of our problems would go away.”

“If we had the time to break up our monolithic service into microservices, we could minimize the impact of problems when they occur.”

When engineers are given additional time in their day, they will inevitably want to work on creative and interesting problems that have a far greater chance of actually improving uptime then the status quo of “fighting fires” ever will.

I believe that AI and machine learning are the perfect technologies to help automate tasks your team takes no pleasure in doing and at the end of the day, is not a good use of their time.

See this content in the original post