AI & Machine Learning Concepts for DevOps and SREs – Part 1

This is part one of a two part post where I’ll break down the most important concepts around AI, machine learning, anomaly detection and predictive analytics from the perspective of DevOps and Site Reliability Engineers (SRE.) I’ll also explore how these ideas and technologies can be used to improve day-to-day operations. Everything from managing alert noise, correlating events, metrics and log data, to anomaly detection and predictive insights.

What is “Artificial Intelligence?”

Artificial intelligence is the name given to programs that have been written to solve problems (often very difficult) which humans can already solve. The goal of many researchers and programmers in this field is to create programs that can arrive to a problem’s solution, autonomously, often without supervision and using methods or logic that might differ from what a human might employ.

Why does AI matter to DevOps and SREs?

For DevOps teams, AI can perform many of the tasks that keep them from working on complex problems that require a high-degree of creativity. These are precisely the types of problems that machines won’t be able to solve for a very long time. Normally, the biggest blocker that keeps teams from working on these “big problems” is a simple lack of time. One the best ways to a team to reclaim time is to automate as many routine tasks as possible. Automation that is fast, consistent and can adapt to new data can dramatically shift how a DevOps team spends it time, working on reactive or predictive tasks. For example:

Reduce alert noise: AI can be used reduce alert noise by correlating and aggregating alerts from multiple monitoring systems into high-level, actionable alerts delivered in a natural language format with immediate access to the underlying data that signaled the problems. Currently, most teams live with overflowing inboxes and messaging apps, plus use manual processes or homegrown systems to help prioritize the issues that matter most.

Root cause analysis & correlation: As any SRE who has conducted root cause analysis can tell you, correlating monitoring data that includes events, metrics and logs can prove very challenging. Why? Because there is a lot of data that is incompatible and siloed. Teams more often than not have limited access to analysis tools that handle all three data types in a single view and perform the necessary correlations. At least in a way that makes root cause analysis easier, not harder. AI has the capacity to correlating these incompatible data sets in real-time and associating them with events that are creating incidents which most of the times produce alerts, to give both a high-level view of the problem and a simple drill-down into the underlying monitoring data, regardless of its format.

Predictive insights: The ability to notify and suggest solutions to DevOps teams of potential problems before they cause outages is likely one of the biggest benefits AI can deliver. The accuracy of these insights will however rely on ongoing access to monitoring data as well as, expert input from the DevOps team. Often times, the solution to a problem requires a high-degree of creativity that no amount of data correlation can arrive to. Therefore, it is important to ask of any DevOps tooling vendor that claims to have AI capabilities, if it has the ability to use human expertise as part of its machine learning training data and whether they are proactive about it (for example, is the solution actively interact with the use? Is the solution actively crawling for data to find the unknown?).

What is the difference between “weak” and “strong AI?”

As the name implies, weak AI or “narrow AI,” is focused on solving very narrow problems or use cases. Examples of weak AI include robots on a manufacturing floor or “virtual assistants” like Amazon’s Alexa/Echo or Apple’s Siri which use voice recognition to retrieve the results of searches or perform basic tasks like play music, voice a calendar reminder or tell you what the weather is San Francisco. In a nutshell, if the AI cannot learn to perform a task it was not originally programmed to carry out, it is most definitely weak AI.

Strong AI is on the other hand can be characterised as AI that has the ability to reason, solve problems, make judgements, strategize, learn new things, interface with humans in a natural way and other traits most commonly thought of as quintessentially “human.”

Why does this difference matter to DevOps and SREs?

It is not hard to imagine how weak AI can immediately help a DevOps team. in fact we discussed a few of these benefits at the beginning of the post. However, strong AI takes a little imagination. Having a fully-virtualized team member that can come up with a solution to technical problem, ask you how your weekend was over Slack and then get upset when you recount the plot of its favorite sitcom before it has a chance to see it, seems far-fetched. At least for now!

What is “machine learning?”

Machine learning is the practical application of AI in the form of a set of algorithms or programs. The “learning” aspect relies on training data and time. Meaning the more relevant data you feed into the program, the longer it can evaluate it, the more sophisticated the algorithms it employs…the more the machine can “learn.”. An example of machine learning could be a program that is constantly being fed stock market data, making predictions based on algorithms, evaluating those predictions against real world outcomes and then adjusting its data processing in an effort to get closer to resembling an accurate prediction about the future performance of the stock market.

Why does machine learning matter to DevOps and SREs?

Machine learning helps DevOps teams by reducing alert noise, correlating data for the purposes of conducting root cause analysis and producing predictive insights. Another example would include anomaly detection. Although the algorithms and mathematics to detect anomalies has been around for a very long time, it has only been until very recently that monitoring tools have begun to incorporate these algorithms into their tools to help SREs spot outliers and react to them. Spotting these anomalies early is important as they may signal (either positively or falsely) a problem with a subsystem.

What is “machine intelligence?”

Machine intelligence is a unified term between artificial intelligence and machine learning.

Why does machine intelligence matter to DevOps and SREs?

Machine intelligence offers very real and tangible benefits to DevOps teams by combining statistical algorithms, classification, regression, bayesian statistics modeling and other machine learning techniques, with the power of a true AI model such as expert systems. It’s the combination between a solid AI engine which allows reinforcement from the SRE engineer with the combination of machine learning algorithms and multiple mathematical approaches that is so powerful and relevant to SREs. Simply applying machine learning algorithms to monitoring data and calling that AI – is simply not accurate.

What is “deep learning?”

Deep learning is a very specific genre of AI that relies on neural network design. The “neurons” or more specifically the nodes in these networks are layered in a way that they provide exponential processing power and learning speed over monolithic AI systems. “Neural” is used to describe this architecture because it closely resembles how the brain processes information.

Why does deep learning matter to DevOps and SREs?

For DevOps teams, deep learning can play a powerful role in processing, identifying patterns or anomalies and correlating data in real-time. Because the neural networks deep learning depends on are often only as powerful as the training data sets they have access to, the ability to process events, metrics, logs, chat logs, runbooks, etc, end up making the correlations and predictions that much more accurate and powerful.