A Deeper Look into Machine Learning Algorithms & Natural Language Understanding for Site Reliability Engineers

Lately we introduced SignifAI Decisions – the most flexible, intuitive correlation engine for SRE and DevOps teams. Beyond activating automatically generated logic and building basic Decisions to reduce alert noise, there are a whole host of features that help create stronger correlations and give you deeper insight into your production system. In this post I'll provide a broad set of examples and principles to machine intelligence and how I believe anyone should think about solving the SRE use-case with advance intelligence capabilities. Obviously my examples here are given using the SignifAI platform, but one could take the same principles and explore other solutions and even implement them as a standalone proprietary solution. My goal here is to provide more information and context into the way I believe the market needs to start tackling those hard problems.

Consolidate events with anomaly detection

One of the first aspects to consider when it comes to evaluating multiple events and information is time and volume. The narrower the time window for evaluating those events and the ability to control the repetition aspects are, the higher control over a broad set of parameters will be achieved. In the advanced mode of SignifAI’s Decision builder, you can specify a timeframe and the minimum number of incidents for a Decision. The time frame is the maximum amount of time between incoming incidents in order for them to be correlated based on the logic you specify in the builder. Specifying a shorter timeframe for broader and more generic logic and a longer timeframe for very narrow logic can help ensure the accuracy and relevance of your correlations. 

You can also specify the minimum number of incidents that need to match the Decision logic before being correlated. This is useful in cases where a large number of incoming incidents, or a spike in the average incident volume, would indicate a correlation – for example, in the event of a datacenter outage, when several outage or unresponsive incidents could be created for totally separate applications that are all dependent on the same datacenter.

0.png

Change Issue priority based on a correlation event

Individual incidents usually have a priority specified by their source – for example, outage events may be classified as “critical,” while minor blips in performance could be “medium.” However, when several incidents are correlated, the meaning of the event could change. In cases like these, you may want the priority of the correlated Issue to be automatically updated in order to reflect this insight. For example, a spike in low-priority incidents could mean an underlying, higher-priority problem. In the Decision builder, you can specify the priority of Issues that are correlated based on your logic – by default, In SignifAI we use the higher priority between the incoming incidents.

Specify the right measure of similarity for your use case

A similarity algorithm that works perfectly for short strings, like healthcheck names or event sources, could be counterproductive for comparisons between longer blocks of text (eg. descriptions).

Here are a couple of the most useful similarity algorithms we have seen to work well on real world production data:

The Levenshtein distance, also known as edit distance, between two strings is the minimum number of single-character edits to get from one string to the other. Allowed edit operations are deletion, insertion, and substitution. Some examples:

  • number/bumble: 3 (number → bumber → bumblr → bumble)

  • trying/lying: 2 (trying → rying → lying)

  • strong/through: 4 (strong → htrong → throng → throug → through)

Some common applications of Levenshtein distance include spelling checkers, computational biology, and speech recognition. The default similarity threshold for new SignifAI Decisions is an edit distance of 3 – that can be changed in the Advanced mode of the Decision builder.

The Jaro-Winkler distance - this metric uses a scale of 0-1 to indicate the similarity between two strings, where 0 is no similarity (0 matching characters between strings) and 1 is an exact match. Jaro-Winkler similarity takes into account:

  • Matching: two characters that are the same and in similar positions in the strings. 

  • Transpositions: matching characters that are in different sequence order in the strings.

  • Prefix scale: the Jaro-Winkler distance is adjusted favorably if strings match from the beginning (a prefix is up to 4 characters). 

This is a useful metric for cases where identical prefixes are a strong indication of correlation.

The Longest Common Subsequence distance (LCS) - is a variation of Levenshtein distance with a more limited set of allowed edit operations.The LCS distance between two strings is the number of single-character insertions or deletions to change one string into another. LCS is most useful in situations where the characters that belong to both strings are the most important – in other words, if there’s a lot of “garbage” or noisy characters in your strings, LCS is a useful metric, since it concentrates on the shared characters to determine similarity. 

The Hamming distance - a simpler version of “edit distance” metrics like Levenshtein distance, the Hamming distance between two strings is the number of substitutions required to turn one string into the other. Hamming distance is computed by counting the number of positions with different characters between the two strings. For example, in the strings below, the Hamming distance is 2 – notice the underlined “different characters”:

                           flowers/florets

Hamming distance requires the compared strings to be of equal length. This is a useful similarity metric for situations where the difference between two strings may be due to typos, or where you want to compare two attributes with known lengths(eg. an incident ID or an instance host or Kubernetes pod name in a specific convention). 

Cosine distance - is another similarity metric we use in our Decision engine to correlate related issues. It’s most commonly used to compare large blocks of text(for example, incident descriptions) and provides an easy visualization of similarity. To obtain the cosine distance between two text blocks, a vector is calculated for each block to represent the count of each unique word in the block. The cosine of the two resulting vectors is the cosine distance between them. For example, take these two strings:

            It is not length of life, but depth of life.

            Depth of life does not depend on length.

Here are the word counts for these sentences:

it 1 0

is 0 1

not 1 1

length 1 1

of 2 1

life 2 1

but 1 0

depth 1 1

does 0 1

depend 0 1

on 0 1

…and here are those counts represented as a vector:

[1, 0, 1, 1, 2, 2, 1, 1, 0, 0, 0]

[0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1]

The angle between these vectors is about 0.9 (~52 degrees). A cosine distance of 1 is the highest similarity. 

As you can see, this metric is less useful for situations where small character differences in words are insignificant, eg. typos – “nagios” and “nagois” would be treated as completely different words. 

Also, cosine distance ignores word order in the text blocks – for example, these two sentences have a cosine similarity of 1 even though they read completely differently:

this is the first of the sentences, it is unscrambled

the sentences is unscrambled, the first of this it is

One of my favorites is the Fuzzy Score. Between two strings, a high fuzzy score indicates high similarity. The fuzzy score algorithm works by allocating “points” for character matches between strings:

  • One point for each matching character

  • Two bonus points for subsequent matches

Note that there are multiple implementation options for Fuzzy score and the above is the most basic one and we have implemented multiple more sophisticated Fuzzy algorithms (inspired by SeatGeek) that are working much better on incidents text context.

Example: SignifAI / sgnfai

s: 1

g: 1

n: 1

f: 1

a: 1

i: 1

gn: 2

fa: 2

ai: 2

= 12 points

Fuzzy score is most useful for relatively short strings and when there are non trivial patterns or in a specific order.

The SignifAI Decision engine gives you the flexibility to choose the right measure of similarity for your logic (in addition to the automatic capabilities), with many options for different use cases as well as provide active suggestions to help you choose from, based on simulation capabilities running on top of your actual past data. Simulating changes based on real data is critical. It's the only way to show how a change in a particular algorithm or ML model can actually make a difference to your data. I deeply believe it might be less important to provide full model control to the end user (and in the case of Deep Learning models it is sometimes not even possible), but providing explainability, reasoning and simulation capabilities to the expert user is very important to gain trust and especially in the use case of Site Reliability Engineering. So many times I'm getting this comment:

"Well, this is all great and I have no clue what a Fuzzy score means, but I would want to play with the data and change the different options to verify it makes sense before I could completely trust the system...And if I know the Automatic suggestions are using the same models and algorithms, explaining it to me in simple terms would allow me to just trust it as is..."

0 (2).png

Compare entire incidents with multiple algorithm options, including machine learning

With SignifAI Decisions, you can create broad logic at the entire incident scale that works seamlessly with your attribute-specific rules. Choose between two correlation metrics:

Jaccard similarity – you can compare the similarity of entire incidents based on keyword matches. Jaccard distance, sometimes referred to as the jaccard similarity coefficient, is one of the most simple measures of similarity to understand – the index, denoted as a percentage (0 being completely dissimilar; 100 being very similar) is calculated with the following formula:

(# of characters in both sets) / (# of characters in either set) * 100

In other words, the Jaccard index is the number of shared characters divided by the total number of characters (shared and un-shared). In the SignifAI Decision builder, you can use Jaccard similarity to compare entire incidents and set a similarity threshold (10 – 99%). 

Jaccard distance is very easy to interpret and especially useful in cases with large data sets – for example, in comparing the similarity between two entire incidents (as opposed to one attribute). It is less effective for small data sets or situations with missing data. Read more about the implementation of jaccard distance here.

Categorical clustering – this option uses a machine learning algorithm to continuously determine clusters of similar events among all incoming data. In order for incidents to be correlated based on this operator, they must satisfy two conditions: 1) They must be in the same cluster, and 2) The distance between each incident and the center of the cluster must be less than the threshold distance.

Let’s break those down a little further. The algorithm continuously categorizes incidents into clusters based on an intelligent combination of attribute values and time series data. The center of each cluster is the incident which represents the mode (highest frequency of common attributes) of all the values in the cluster, and the distance from the center indicates how similar each incident is to that mode. In this visualization, each color represents a different cluster:

0 (3).png

The threshold you set in the rule builder is normalized to represent the distance between an individual incident and the center of its cluster. A high threshold represents a short distance (high correlation) between two incidents in a cluster. 

In the following example, let’s assume incidents 1-4 match all of your attribute-specific logic (eg. source equals nagios). Using categorical clustering, if you set a threshold of 95%, no incidents would be correlated. A threshold of 75% catches incidents 1 and 2, and a threshold of 0% catches incidents 1, 2, and 3. Incident 4 would never be correlated, since it’s not in the same cluster.

Categorical clustering is a great approach for incidents that may look fairly different at the attribute-specific level, but are related at the entire incident/time series data level. For example, incidents 2 and 3 may not be correlated by a high-threshold Jaccard distance, but could be correlated with categorical clustering.

You can learn more about categorical clustering algorithms here.

Stay tuned for another post about deep clustering, another comparison algorithm that’s currently in private beta but is implementing a neural net approach for automatic classification - trained explicitly on incidents data.

0 (5).png

Automatically compare custom or dynamic attributes, no tagging required

With subtree logic in the SignifAI Decision builder, you don’t need to know the exact name of each event attribute – you can build flexible logic based on just a prefix string, like aws. No need to spend hours tagging individual attributes, or updating individual Decisions if your event schema changes – it just works. This key attribute when designing an intelligent system is essential. We really want the system to be usable from the, and when it comes to digital infrastructure, systems and architecture keep changing and dynamically adapted to the business needs. It would be impossible to demand the users to keep tagging, training or informing an intelligent solution with those constant changes. A real intelligent system puts much emphasis on the internal data modeling, learning from the data itself and avoiding as much as possible user involvement. That's not to say that we are going to minimize any user feedback which can form a better-supervised model. 

0 (6).png

Leverage automatic NLP classification to determine alert symptoms

One of the most unexplored fields when it comes to incident management is natural language understanding/processing. If you think about it, the most useful information in the incident data is actually text based. Because of that - it is actually a perfect candidate to NLP techniques for classification and reasoning. SignifAI’s NLP classifier runs machine learning algorithms over all of your incoming data in order to determine the best-matching predefined classes for each event, and then exposes those classes in the Decision engine to provide context for stronger correlations.

Here is one example: the classes and subclasses that SignifAI automatically identifies are “symptoms” of your production system, with the highest-level classes defined by the “4 golden signals” described in the Google SRE book, plus one more (availability):

  • Errors

  • Load

  • Latency

  • Saturation

  • Availability

We keep training the models on an SRE/DevOps incident dataset, so that you don’t need to do any tagging or configuration work to use these classes. They’re available in the decision builder or in Suggested Decisions automatically.

Conclusions

There is so much more to write about, and I'm going to continue in the coming weeks. In the meantime, here are the most critical points as a take away:

  • Machine Intelligence for SREs must first take into account the expert user by providing the complete flexibility and broad capabilities as expected by technical users.

  • A combination of approaches is required to achieve good results. Claiming a specific approach is probably not going to cut it for the SRE use case.

  • Explainability of the models and automatic capabilities is highly essential to gain the user's trust and engagement. Being able to show and explain why a specific action was taken is a combination of controlling the model and involving the expert user is key.

  • Categorical classification, combined with multiple similarities algorithms, NLP approaches, supervised and unsupervised ML models running in parallel and overtime is a combined approach that provides immediate value to the user and allows improvements over time based on historical and real-time data.

One last point, I always admired John Allspaw for this blog post. I think it is very accurate that software will assist a team of expert humans who will always be smarter than a generic code. However, I also believe that the market today is hardly taking the right approaches when it comes to the specific SRE use case. It is a very sophisticated use case and a significant problem to try and solve, but not many AI/ML data scientists and researchers have the depths and understanding of the user and the use case, and so we are finding ourselves with generic intelligent models that are hardly relevant to the specific use case.

We are here to change that.