AI & Machine Learning Concepts for DevOps and SREs – Part 2

This is part two of a two part series where I break down the most important concepts around AI, machine learning, anomaly detection and predictive analytics from the perspective of DevOps and Site Reliability Engineers (SRE.) I’ll also explore how these ideas and technologies can be used to improve day-to-day operations. Everything from managing alert noise, correlating events, metrics and log data, to anomaly detection and predictive insights. This time we’ll dive into more detailed concepts, building off of the ideas in Part One.

What are “predictive analytics?”

The goal of “predictive analytics” is to analyze historical and recent data in order to identify patterns that can inform predictions about behaviors, outcomes, events or performance in the future.

Why do predictive analytics matter to DevOps and SREs?

Predictive analytics, or what we at SignifAI refer to as “Predictive Insights” is especially valuable to DevOps teams. These predictions are powerful because they correlate incompatible data formats, which includes data you are actively monitoring and that which you aren’t. An example of a predictive insight might include an alert that informs you to the fact that your application will violate multiple performance thresholds because the API of a 3rd party provider will go offline for maintenance, but no one on the team reacted to the notification that appeared in a Slack chat weeks ago.

What are the differences between “training,” “validation” and “test” data sets?

“Training,” “validation” and “test”, data sets are utilizing by machine learning algorithms and programs to use as a basis for its learning. The data sets themselves can be characterized in the following ways:

Training data set: This is the data the machine learning program is given access to during its initial “learning” phase. The output of the “learning” can be the machine successfully finding previously unknown relationships between data points within the data set.

Validation data set: The machine learning program makes use of this data set after the completion of the “learning” phase above, in order to identify which relationships will be the most useful at predicting future events or outcomes.

Test data set: This is the final data set the machine learning program utilizes to simulate real-world predictions. The predictions themselves can then be evaluated for accuracy before being rolled out into production.

Why does understanding the function of these data sets matter to DevOps and SREs?

For DevOps teams it is valuable to understand the function of the underlying data sets that machines use to arrive to their recommendations. Why? Because for the foreseeable future, many SREs will be asked to respond to or implement a recommendations that are not automated. By understanding the applicability, rigor of the analysis and utility of the data sets, an SRE can have greater confidence (or less) in the machine’s request.

What is “natural language processing?”

In the context of machine learning, “natural language processing” (NLP) are the efforts to remove as much of the friction as possible in the interactions between machines and people. More specifically, NLP attempts to enable machines to make sense of human language so that the interfacing required between computers and people can be improved.

What is “natural language generation?”

“Natural language generation” or NLG are the algorithms programmers make use of so that machines can produce language (written or spoken) that to a human can not readily be identified as being generated by a machine. Examples of NLG might include automated voice prompts that adapt to verbal queues and chatbots which when interacted with feel like you are talking to an actual customer service representative.

Why does NLP and NLG matter to DevOps and SREs?

NLP and NLG is essential if we ever expect a “virtual SRE” team member that is wholly AI driven, to be able to make decisions or give recommendations that take into account relevant information that might appear in text messages, chat logs, email threads and maybe even conference calls. Obviously, the ability to process this type of data is a first step, with the final desired output being notifications and recommendations that are delivered in a way that literally makes it look/sound like it is being delivered by a human team member. The advantage here is that it means less time “translating” less human friendly machine-speak.

What are “neural networks?”

Artificial neural networks are architected in a way so that they closely resemble the way brain neurons are connected to each other and process information. Therefore the name, “neural network.” As mentioned in Part One of this series, deep learning makes heavy use of neural network design. A more specialized version of a neural network is a “recurrent neural network.” In this type of a network the outputs of the network are fed back into itself allowing it to use the learning it has achieved so far to more efficiently sort subsequent data.

Why do neural networks matter to DevOps and SREs?

For the SRE, the use of neural networks combined with deep learning makes for a powerful combination that can give exponential leverage in the processing of monitoring data, identifying patterns or anomalies and in correlating incompatible data sets/types in real-time.

What are the differences between “reinforcement,” “supervised” and “unsupervised learning?”

There are typically three ways in which machines “learn”. These include:

Reinforcement learning happens when machines are trained to use experimentation and reward to arrive to their outputs. As in psychology, the machine is rewarded when it delivers desirable outputs and “punished” when it does otherwise.

Supervised learning happens when programmers are actively engaged in the machine’s “learning” process. Programmers intervene at almost every step including the curation of data sets and manually validating outputs to ensure that the desired output is achieved.

Unsupervised learning is the opposite of the above. In this scenario the machine is allowed to arrive to whatever conclusions it wants to based on the data fed into the program.

Why do these differences matter to DevOps and SREs?

Similar to the importance of understanding the role data sets play in how a machine learns, it is also important for DevOps teams to understand the different ways in which a machine actually “learns.” By understanding the methods in which a machine comes to a conclusion or recommendation, an SRE can have greater confidence (or less) in the machine’s prediction.

AI & Machine Learning Concepts for DevOps and SREs – Part 1

This is part one of a two part post where I’ll break down the most important concepts around AI, machine learning, anomaly detection and predictive analytics from the perspective of DevOps and Site Reliability Engineers (SRE.) I’ll also explore how these ideas and technologies can be used to improve day-to-day operations. Everything from managing alert noise, correlating events, metrics and log data, to anomaly detection and predictive insights.

What is “Artificial Intelligence?”

Artificial intelligence is the name given to programs that have been written to solve problems (often very difficult) which humans can already solve. The goal of many researchers and programmers in this field is to create programs that can arrive to a problem’s solution, autonomously, often without supervision and using methods or logic that might differ from what a human might employ.

Why does AI matter to DevOps and SREs?

For DevOps teams, AI can perform many of the tasks that keep them from working on complex problems that require a high-degree of creativity. These are precisely the types of problems that machines won’t be able to solve for a very long time. Normally, the biggest blocker that keeps teams from working on these “big problems” is a simple lack of time. One the best ways to a team to reclaim time is to automate as many routine tasks as possible. Automation that is fast, consistent and can adapt to new data can dramatically shift how a DevOps team spends it time, working on reactive or predictive tasks. For example:

Reduce alert noise: AI can be used reduce alert noise by correlating and aggregating alerts from multiple monitoring systems into high-level, actionable alerts delivered in a natural language format with immediate access to the underlying data that signaled the problems. Currently, most teams live with overflowing inboxes and messaging apps, plus use manual processes or homegrown systems to help prioritize the issues that matter most.

Root cause analysis & correlation: As any SRE who has conducted root cause analysis can tell you, correlating monitoring data that includes events, metrics and logs can prove very challenging. Why? Because there is a lot of data that is incompatible and siloed. Teams more often than not have limited access to analysis tools that handle all three data types in a single view and perform the necessary correlations. At least in a way that makes root cause analysis easier, not harder. AI has the capacity to correlating these incompatible data sets in real-time and associating them with events that are creating incidents which most of the times produce alerts, to give both a high-level view of the problem and a simple drill-down into the underlying monitoring data, regardless of its format.

Predictive insights: The ability to notify and suggest solutions to DevOps teams of potential problems before they cause outages is likely one of the biggest benefits AI can deliver. The accuracy of these insights will however rely on ongoing access to monitoring data as well as, expert input from the DevOps team. Often times, the solution to a problem requires a high-degree of creativity that no amount of data correlation can arrive to. Therefore, it is important to ask of any DevOps tooling vendor that claims to have AI capabilities, if it has the ability to use human expertise as part of its machine learning training data and whether they are proactive about it (for example, is the solution actively interact with the use? Is the solution actively crawling for data to find the unknown?).

What is the difference between “weak” and “strong AI?”

As the name implies, weak AI or “narrow AI,” is focused on solving very narrow problems or use cases. Examples of weak AI include robots on a manufacturing floor or “virtual assistants” like Amazon’s Alexa/Echo or Apple’s Siri which use voice recognition to retrieve the results of searches or perform basic tasks like play music, voice a calendar reminder or tell you what the weather is San Francisco. In a nutshell, if the AI cannot learn to perform a task it was not originally programmed to carry out, it is most definitely weak AI.

Strong AI is on the other hand can be characterised as AI that has the ability to reason, solve problems, make judgements, strategize, learn new things, interface with humans in a natural way and other traits most commonly thought of as quintessentially “human.”

Why does this difference matter to DevOps and SREs?

It is not hard to imagine how weak AI can immediately help a DevOps team. in fact we discussed a few of these benefits at the beginning of the post. However, strong AI takes a little imagination. Having a fully-virtualized team member that can come up with a solution to technical problem, ask you how your weekend was over Slack and then get upset when you recount the plot of its favorite sitcom before it has a chance to see it, seems far-fetched. At least for now!

What is “machine learning?”

Machine learning is the practical application of AI in the form of a set of algorithms or programs. The “learning” aspect relies on training data and time. Meaning the more relevant data you feed into the program, the longer it can evaluate it, the more sophisticated the algorithms it employs…the more the machine can “learn.”. An example of machine learning could be a program that is constantly being fed stock market data, making predictions based on algorithms, evaluating those predictions against real world outcomes and then adjusting its data processing in an effort to get closer to resembling an accurate prediction about the future performance of the stock market.

Why does machine learning matter to DevOps and SREs?

Machine learning helps DevOps teams by reducing alert noise, correlating data for the purposes of conducting root cause analysis and producing predictive insights. Another example would include anomaly detection. Although the algorithms and mathematics to detect anomalies has been around for a very long time, it has only been until very recently that monitoring tools have begun to incorporate these algorithms into their tools to help SREs spot outliers and react to them. Spotting these anomalies early is important as they may signal (either positively or falsely) a problem with a subsystem.

What is “machine intelligence?”

Machine intelligence is a unified term between artificial intelligence and machine learning.

Why does machine intelligence matter to DevOps and SREs?

Machine intelligence offers very real and tangible benefits to DevOps teams by combining statistical algorithms, classification, regression, bayesian statistics modeling and other machine learning techniques, with the power of a true AI model such as expert systems. It’s the combination between a solid AI engine which allows reinforcement from the SRE engineer with the combination of machine learning algorithms and multiple mathematical approaches that is so powerful and relevant to SREs. Simply applying machine learning algorithms to monitoring data and calling that AI – is simply not accurate.

What is “deep learning?”

Deep learning is a very specific genre of AI that relies on neural network design. The “neurons” or more specifically the nodes in these networks are layered in a way that they provide exponential processing power and learning speed over monolithic AI systems. “Neural” is used to describe this architecture because it closely resembles how the brain processes information.

Why does deep learning matter to DevOps and SREs?

For DevOps teams, deep learning can play a powerful role in processing, identifying patterns or anomalies and correlating data in real-time. Because the neural networks deep learning depends on are often only as powerful as the training data sets they have access to, the ability to process events, metrics, logs, chat logs, runbooks, etc, end up making the correlations and predictions that much more accurate and powerful.

Achieving Increased Uptime with Machine Intelligence

I am pleased to announce today the general availability of SignifAI’s machine intelligence platform! This product release is the realization of all the hard learned lessons of my colleagues and my last 10 years working in TechOps.

What is SignifAI?

SignifAI is a cloud-based machine intelligence platform that integrates mostly over API with over 60+ monitoring tools (commercial and open source) like New Relic, AppDynamics, Splunk, Prometheus, Nagios and many more. It gives TechOps fast answers and predictive insights to the most critical issues that threaten system uptime. We accomplish this by combining machine intelligence with a TechOps team’s expertise, not generic learning algorithms.

Why SignifAI?

In our experience as TechOps engineers, despite having a variety of monitoring tools, we still operated in a highly reactive mode. Because we were constantly fighting
fires, we never found the time to implement the processes and systems that allowed us identify issues before they caused outages. Frustrated by the available solutions, we decided last year to take a new approach at uptime using machine intelligence. Specifically, the problems we wanted to solve were:

  • No more siloed data: Monitoring systems with incompatible data formats made it challenging to make meaningful correlations across the application and infrastructure stack. Being able to create a meaningful understanding between triggered events and multiple time-series streams is not easy.

  • Being proactive in the collection phase: In the past, we have missed very important events from existing solutions. We have learned (the hard way) that waiting for events is not enough. Being proactive in the collection phase has huge advantage but it’s hard and require expertise.

  • Alert noise: Having lots of monitoring tools lead to alert overload making it hard to prioritize the most important issues.

  • Inefficient root cause analysis: Correlating large volumes of log, event and metrics data in real-time was challenging even for the most experienced engineers.

  • Cumbersome knowledge capture: We often struggled with how to document and share “lessons learned” so that applicable solutions from the past could be applied to currently open issues in an automated way. We always wanted something that can detect potential issues based on our previous experience.

How SignifAI works

“A picture is worth a thousand words” as they say and I think the best way to understand exactly how SignifAI works, is to see it in action. Check out this short 2 minute demo that walks you through all the key features of the platform including:

  • Collecting data with sensors

  • Responding to alerts

  • Performing root cause analysis

  • Capturing your expertise and knowledge

  • Working with predictive Insights and Answers™

In upcoming blogs, I’m going to review in depth some of the technologies we are using such as:

  • Our data pipeline: Focusing on active collection over APIs.

  • Algorithms: Some of the outliers statistical algorithms we have decided to implement, especially for time-series data.

  • Event collection and processing: Our correlation engine, decision and evaluation engine and why our approach is so different and focused for the TechOps use case.

  • Open source: Our contribution to open-source and why we have decided to adapt Snap Open Telemetry framework and extend it.

Getting started is quick and enabling integrations to the monitoring tools you already use is simple. The typical time to the first predictive insight is less than 20 minutes.

SignifAI believes that when TechOps teams deliver more uptime, they can find the time to work on more complex problems that require creative solutions — precisely the things machines can’t do!

In closing, I would like to emphasise our commitment to the TechOps community. There are great open source monitoring and alerting solutions out there as well as other awesome vendor tools. We are not in a mission to replace and consolidate those. We also believe that we should not capture your data and knowledge without allowing you the ability to export it at any given time. That’s why we have invested in full exporting functionality. Our believe and goal is making our day to day as a community better by empowering people.

Machine Intelligence is a very broad definition, and we are totally aware many times abused and overused. That is exactly why we are bringing to market the first release which is based on many years of experience and a combination of computational techniques but which puts the expert human in the center.