AI & Machine Learning Concepts for DevOps and SREs – Part 2

This is part two of a two part series where I break down the most important concepts around AI, machine learning, anomaly detection and predictive analytics from the perspective of DevOps and Site Reliability Engineers (SRE.) I’ll also explore how these ideas and technologies can be used to improve day-to-day operations. Everything from managing alert noise, correlating events, metrics and log data, to anomaly detection and predictive insights. This time we’ll dive into more detailed concepts, building off of the ideas in Part One.

What are “predictive analytics?”

The goal of “predictive analytics” is to analyze historical and recent data in order to identify patterns that can inform predictions about behaviors, outcomes, events or performance in the future.

Why do predictive analytics matter to DevOps and SREs?

Predictive analytics, or what we at SignifAI refer to as “Predictive Insights” is especially valuable to DevOps teams. These predictions are powerful because they correlate incompatible data formats, which includes data you are actively monitoring and that which you aren’t. An example of a predictive insight might include an alert that informs you to the fact that your application will violate multiple performance thresholds because the API of a 3rd party provider will go offline for maintenance, but no one on the team reacted to the notification that appeared in a Slack chat weeks ago.

What are the differences between “training,” “validation” and “test” data sets?

“Training,” “validation” and “test”, data sets are utilizing by machine learning algorithms and programs to use as a basis for its learning. The data sets themselves can be characterized in the following ways:

Training data set: This is the data the machine learning program is given access to during its initial “learning” phase. The output of the “learning” can be the machine successfully finding previously unknown relationships between data points within the data set.

Validation data set: The machine learning program makes use of this data set after the completion of the “learning” phase above, in order to identify which relationships will be the most useful at predicting future events or outcomes.

Test data set: This is the final data set the machine learning program utilizes to simulate real-world predictions. The predictions themselves can then be evaluated for accuracy before being rolled out into production.

Why does understanding the function of these data sets matter to DevOps and SREs?

For DevOps teams it is valuable to understand the function of the underlying data sets that machines use to arrive to their recommendations. Why? Because for the foreseeable future, many SREs will be asked to respond to or implement a recommendations that are not automated. By understanding the applicability, rigor of the analysis and utility of the data sets, an SRE can have greater confidence (or less) in the machine’s request.

What is “natural language processing?”

In the context of machine learning, “natural language processing” (NLP) are the efforts to remove as much of the friction as possible in the interactions between machines and people. More specifically, NLP attempts to enable machines to make sense of human language so that the interfacing required between computers and people can be improved.

What is “natural language generation?”

“Natural language generation” or NLG are the algorithms programmers make use of so that machines can produce language (written or spoken) that to a human can not readily be identified as being generated by a machine. Examples of NLG might include automated voice prompts that adapt to verbal queues and chatbots which when interacted with feel like you are talking to an actual customer service representative.

Why does NLP and NLG matter to DevOps and SREs?

NLP and NLG is essential if we ever expect a “virtual SRE” team member that is wholly AI driven, to be able to make decisions or give recommendations that take into account relevant information that might appear in text messages, chat logs, email threads and maybe even conference calls. Obviously, the ability to process this type of data is a first step, with the final desired output being notifications and recommendations that are delivered in a way that literally makes it look/sound like it is being delivered by a human team member. The advantage here is that it means less time “translating” less human friendly machine-speak.

What are “neural networks?”

Artificial neural networks are architected in a way so that they closely resemble the way brain neurons are connected to each other and process information. Therefore the name, “neural network.” As mentioned in Part One of this series, deep learning makes heavy use of neural network design. A more specialized version of a neural network is a “recurrent neural network.” In this type of a network the outputs of the network are fed back into itself allowing it to use the learning it has achieved so far to more efficiently sort subsequent data.

Why do neural networks matter to DevOps and SREs?

For the SRE, the use of neural networks combined with deep learning makes for a powerful combination that can give exponential leverage in the processing of monitoring data, identifying patterns or anomalies and in correlating incompatible data sets/types in real-time.

What are the differences between “reinforcement,” “supervised” and “unsupervised learning?”

There are typically three ways in which machines “learn”. These include:

Reinforcement learning happens when machines are trained to use experimentation and reward to arrive to their outputs. As in psychology, the machine is rewarded when it delivers desirable outputs and “punished” when it does otherwise.

Supervised learning happens when programmers are actively engaged in the machine’s “learning” process. Programmers intervene at almost every step including the curation of data sets and manually validating outputs to ensure that the desired output is achieved.

Unsupervised learning is the opposite of the above. In this scenario the machine is allowed to arrive to whatever conclusions it wants to based on the data fed into the program.

Why do these differences matter to DevOps and SREs?

Similar to the importance of understanding the role data sets play in how a machine learns, it is also important for DevOps teams to understand the different ways in which a machine actually “learns.” By understanding the methods in which a machine comes to a conclusion or recommendation, an SRE can have greater confidence (or less) in the machine’s prediction.