A Deeper Look into Machine Learning Algorithms & Natural Language Understanding for Site Reliability Engineers

Lately we introduced SignifAI Decisions – the most flexible, intuitive correlation engine for SRE and DevOps teams. Beyond activating automatically generated logic and building basic Decisions to reduce alert noise, there are a whole host of features that help create stronger correlations and give you deeper insight into your production system. In this post I'll provide a broad set of examples and principles to machine intelligence and how I believe anyone should think about solving the SRE use-case with advance intelligence capabilities. Obviously my examples here are given using the SignifAI platform, but one could take the same principles and explore other solutions and even implement them as a standalone proprietary solution. My goal here is to provide more information and context into the way I believe the market needs to start tackling those hard problems.

Consolidate events with anomaly detection

One of the first aspects to consider when it comes to evaluating multiple events and information is time and volume. The narrower the time window for evaluating those events and the ability to control the repetition aspects are, the higher control over a broad set of parameters will be achieved. In the advanced mode of SignifAI’s Decision builder, you can specify a timeframe and the minimum number of incidents for a Decision. The time frame is the maximum amount of time between incoming incidents in order for them to be correlated based on the logic you specify in the builder. Specifying a shorter timeframe for broader and more generic logic and a longer timeframe for very narrow logic can help ensure the accuracy and relevance of your correlations. 

You can also specify the minimum number of incidents that need to match the Decision logic before being correlated. This is useful in cases where a large number of incoming incidents, or a spike in the average incident volume, would indicate a correlation – for example, in the event of a datacenter outage, when several outage or unresponsive incidents could be created for totally separate applications that are all dependent on the same datacenter.

0.png

Change Issue priority based on a correlation event

Individual incidents usually have a priority specified by their source – for example, outage events may be classified as “critical,” while minor blips in performance could be “medium.” However, when several incidents are correlated, the meaning of the event could change. In cases like these, you may want the priority of the correlated Issue to be automatically updated in order to reflect this insight. For example, a spike in low-priority incidents could mean an underlying, higher-priority problem. In the Decision builder, you can specify the priority of Issues that are correlated based on your logic – by default, In SignifAI we use the higher priority between the incoming incidents.

Specify the right measure of similarity for your use case

A similarity algorithm that works perfectly for short strings, like healthcheck names or event sources, could be counterproductive for comparisons between longer blocks of text (eg. descriptions).

Here are a couple of the most useful similarity algorithms we have seen to work well on real world production data:

The Levenshtein distance, also known as edit distance, between two strings is the minimum number of single-character edits to get from one string to the other. Allowed edit operations are deletion, insertion, and substitution. Some examples:

  • number/bumble: 3 (number → bumber → bumblr → bumble)

  • trying/lying: 2 (trying → rying → lying)

  • strong/through: 4 (strong → htrong → throng → throug → through)

Some common applications of Levenshtein distance include spelling checkers, computational biology, and speech recognition. The default similarity threshold for new SignifAI Decisions is an edit distance of 3 – that can be changed in the Advanced mode of the Decision builder.

The Jaro-Winkler distance - this metric uses a scale of 0-1 to indicate the similarity between two strings, where 0 is no similarity (0 matching characters between strings) and 1 is an exact match. Jaro-Winkler similarity takes into account:

  • Matching: two characters that are the same and in similar positions in the strings. 

  • Transpositions: matching characters that are in different sequence order in the strings.

  • Prefix scale: the Jaro-Winkler distance is adjusted favorably if strings match from the beginning (a prefix is up to 4 characters). 

This is a useful metric for cases where identical prefixes are a strong indication of correlation.

The Longest Common Subsequence distance (LCS) - is a variation of Levenshtein distance with a more limited set of allowed edit operations.The LCS distance between two strings is the number of single-character insertions or deletions to change one string into another. LCS is most useful in situations where the characters that belong to both strings are the most important – in other words, if there’s a lot of “garbage” or noisy characters in your strings, LCS is a useful metric, since it concentrates on the shared characters to determine similarity. 

The Hamming distance - a simpler version of “edit distance” metrics like Levenshtein distance, the Hamming distance between two strings is the number of substitutions required to turn one string into the other. Hamming distance is computed by counting the number of positions with different characters between the two strings. For example, in the strings below, the Hamming distance is 2 – notice the underlined “different characters”:

                           flowers/florets

Hamming distance requires the compared strings to be of equal length. This is a useful similarity metric for situations where the difference between two strings may be due to typos, or where you want to compare two attributes with known lengths(eg. an incident ID or an instance host or Kubernetes pod name in a specific convention). 

Cosine distance - is another similarity metric we use in our Decision engine to correlate related issues. It’s most commonly used to compare large blocks of text(for example, incident descriptions) and provides an easy visualization of similarity. To obtain the cosine distance between two text blocks, a vector is calculated for each block to represent the count of each unique word in the block. The cosine of the two resulting vectors is the cosine distance between them. For example, take these two strings:

            It is not length of life, but depth of life.

            Depth of life does not depend on length.

Here are the word counts for these sentences:

it 1 0

is 0 1

not 1 1

length 1 1

of 2 1

life 2 1

but 1 0

depth 1 1

does 0 1

depend 0 1

on 0 1

…and here are those counts represented as a vector:

[1, 0, 1, 1, 2, 2, 1, 1, 0, 0, 0]

[0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1]

The angle between these vectors is about 0.9 (~52 degrees). A cosine distance of 1 is the highest similarity. 

As you can see, this metric is less useful for situations where small character differences in words are insignificant, eg. typos – “nagios” and “nagois” would be treated as completely different words. 

Also, cosine distance ignores word order in the text blocks – for example, these two sentences have a cosine similarity of 1 even though they read completely differently:

this is the first of the sentences, it is unscrambled

the sentences is unscrambled, the first of this it is

One of my favorites is the Fuzzy Score. Between two strings, a high fuzzy score indicates high similarity. The fuzzy score algorithm works by allocating “points” for character matches between strings:

  • One point for each matching character

  • Two bonus points for subsequent matches

Note that there are multiple implementation options for Fuzzy score and the above is the most basic one and we have implemented multiple more sophisticated Fuzzy algorithms (inspired by SeatGeek) that are working much better on incidents text context.

Example: SignifAI / sgnfai

s: 1

g: 1

n: 1

f: 1

a: 1

i: 1

gn: 2

fa: 2

ai: 2

= 12 points

Fuzzy score is most useful for relatively short strings and when there are non trivial patterns or in a specific order.

The SignifAI Decision engine gives you the flexibility to choose the right measure of similarity for your logic (in addition to the automatic capabilities), with many options for different use cases as well as provide active suggestions to help you choose from, based on simulation capabilities running on top of your actual past data. Simulating changes based on real data is critical. It's the only way to show how a change in a particular algorithm or ML model can actually make a difference to your data. I deeply believe it might be less important to provide full model control to the end user (and in the case of Deep Learning models it is sometimes not even possible), but providing explainability, reasoning and simulation capabilities to the expert user is very important to gain trust and especially in the use case of Site Reliability Engineering. So many times I'm getting this comment:

"Well, this is all great and I have no clue what a Fuzzy score means, but I would want to play with the data and change the different options to verify it makes sense before I could completely trust the system...And if I know the Automatic suggestions are using the same models and algorithms, explaining it to me in simple terms would allow me to just trust it as is..."

0 (2).png

Compare entire incidents with multiple algorithm options, including machine learning

With SignifAI Decisions, you can create broad logic at the entire incident scale that works seamlessly with your attribute-specific rules. Choose between two correlation metrics:

Jaccard similarity – you can compare the similarity of entire incidents based on keyword matches. Jaccard distance, sometimes referred to as the jaccard similarity coefficient, is one of the most simple measures of similarity to understand – the index, denoted as a percentage (0 being completely dissimilar; 100 being very similar) is calculated with the following formula:

(# of characters in both sets) / (# of characters in either set) * 100

In other words, the Jaccard index is the number of shared characters divided by the total number of characters (shared and un-shared). In the SignifAI Decision builder, you can use Jaccard similarity to compare entire incidents and set a similarity threshold (10 – 99%). 

Jaccard distance is very easy to interpret and especially useful in cases with large data sets – for example, in comparing the similarity between two entire incidents (as opposed to one attribute). It is less effective for small data sets or situations with missing data. Read more about the implementation of jaccard distance here.

Categorical clustering – this option uses a machine learning algorithm to continuously determine clusters of similar events among all incoming data. In order for incidents to be correlated based on this operator, they must satisfy two conditions: 1) They must be in the same cluster, and 2) The distance between each incident and the center of the cluster must be less than the threshold distance.

Let’s break those down a little further. The algorithm continuously categorizes incidents into clusters based on an intelligent combination of attribute values and time series data. The center of each cluster is the incident which represents the mode (highest frequency of common attributes) of all the values in the cluster, and the distance from the center indicates how similar each incident is to that mode. In this visualization, each color represents a different cluster:

0 (3).png

The threshold you set in the rule builder is normalized to represent the distance between an individual incident and the center of its cluster. A high threshold represents a short distance (high correlation) between two incidents in a cluster. 

In the following example, let’s assume incidents 1-4 match all of your attribute-specific logic (eg. source equals nagios). Using categorical clustering, if you set a threshold of 95%, no incidents would be correlated. A threshold of 75% catches incidents 1 and 2, and a threshold of 0% catches incidents 1, 2, and 3. Incident 4 would never be correlated, since it’s not in the same cluster.

Categorical clustering is a great approach for incidents that may look fairly different at the attribute-specific level, but are related at the entire incident/time series data level. For example, incidents 2 and 3 may not be correlated by a high-threshold Jaccard distance, but could be correlated with categorical clustering.

You can learn more about categorical clustering algorithms here.

Stay tuned for another post about deep clustering, another comparison algorithm that’s currently in private beta but is implementing a neural net approach for automatic classification - trained explicitly on incidents data.

0 (5).png

Automatically compare custom or dynamic attributes, no tagging required

With subtree logic in the SignifAI Decision builder, you don’t need to know the exact name of each event attribute – you can build flexible logic based on just a prefix string, like aws. No need to spend hours tagging individual attributes, or updating individual Decisions if your event schema changes – it just works. This key attribute when designing an intelligent system is essential. We really want the system to be usable from the, and when it comes to digital infrastructure, systems and architecture keep changing and dynamically adapted to the business needs. It would be impossible to demand the users to keep tagging, training or informing an intelligent solution with those constant changes. A real intelligent system puts much emphasis on the internal data modeling, learning from the data itself and avoiding as much as possible user involvement. That's not to say that we are going to minimize any user feedback which can form a better-supervised model. 

0 (6).png

Leverage automatic NLP classification to determine alert symptoms

One of the most unexplored fields when it comes to incident management is natural language understanding/processing. If you think about it, the most useful information in the incident data is actually text based. Because of that - it is actually a perfect candidate to NLP techniques for classification and reasoning. SignifAI’s NLP classifier runs machine learning algorithms over all of your incoming data in order to determine the best-matching predefined classes for each event, and then exposes those classes in the Decision engine to provide context for stronger correlations.

Here is one example: the classes and subclasses that SignifAI automatically identifies are “symptoms” of your production system, with the highest-level classes defined by the “4 golden signals” described in the Google SRE book, plus one more (availability):

  • Errors

  • Load

  • Latency

  • Saturation

  • Availability

We keep training the models on an SRE/DevOps incident dataset, so that you don’t need to do any tagging or configuration work to use these classes. They’re available in the decision builder or in Suggested Decisions automatically.

Conclusions

There is so much more to write about, and I'm going to continue in the coming weeks. In the meantime, here are the most critical points as a take away:

  • Machine Intelligence for SREs must first take into account the expert user by providing the complete flexibility and broad capabilities as expected by technical users.

  • A combination of approaches is required to achieve good results. Claiming a specific approach is probably not going to cut it for the SRE use case.

  • Explainability of the models and automatic capabilities is highly essential to gain the user's trust and engagement. Being able to show and explain why a specific action was taken is a combination of controlling the model and involving the expert user is key.

  • Categorical classification, combined with multiple similarities algorithms, NLP approaches, supervised and unsupervised ML models running in parallel and overtime is a combined approach that provides immediate value to the user and allows improvements over time based on historical and real-time data.

One last point, I always admired John Allspaw for this blog post. I think it is very accurate that software will assist a team of expert humans who will always be smarter than a generic code. However, I also believe that the market today is hardly taking the right approaches when it comes to the specific SRE use case. It is a very sophisticated use case and a significant problem to try and solve, but not many AI/ML data scientists and researchers have the depths and understanding of the user and the use case, and so we are finding ourselves with generic intelligent models that are hardly relevant to the specific use case.

We are here to change that.

DevOps Tools - Something is Still Wrong...

Although companies have invested heavily in automating specific portions of their DevOps pipeline in the last 10 years, everything from build and test, log management, performance monitoring and incident management, the common refrain we hear is that despite being able to ship more code, faster and with fewer defects, their system and application availability has not moved dramatically in a positive direction. Why is that?

More often than not, this is because the engineers that are more “Ops” than “Dev” in an organization (those actually responsible for running code in production) are constantly “fighting fires.”

Why are teams “fighting fires” vs working on tasks that could dramatically improve uptime? They are hampered by the inefficiencies introduced by the very DevOps tools which are supposed to help them. A lack of integration, visibility and correlation between the tools forces engineers to waste time on tasks that could be automated but lack the tools to make it so.

More specifically, here are a few examples in a typical incident management workflow where there exists plenty of opportunities to automate…

  • Dealing with “alert noise”.

  • Searching for and correlating the relevant logs, events and metrics in order to create impactful solutions based on informed root cause analysis.

  • Searching for known solutions you know are applicable to the current issue but just can’t locate them efficiently.

  • Analyzing data, patterns and behaviors to understand what sort of preventive fixes could be implemented today that would dramatically improve uptime tomorrow.

Alert noise

First, “what is alert noise?” Alert noise is that thing you experience when you see your Slack channel or phone clogged with informational, test, duplicate, unprioritized and irrelevant alerts.

To easily understand how “alert noise” affects system and application availability answer this simple question:

What percentage of the alerts do you receive on a daily basis that you end up immediately dismissing or spend some time debating whether or not to ignore?

Now, do the math. If an SRE on staff spends even 20 mins a day acknowledging, prioritizing, researching and ultimately dismissing alerts, that’s at least 87 hours a year wasted. Multiply that by the number of SREs on staff and ask yourself, “What could an engineer be working on during those 87 hours if it wasn’t spent ignoring or dismissing alerts?” The answer is likely, “Working on the things we know will dramatically improve our uptime, but we simply don’t have the time to work on because we are constantly fighting fires.”

What can be done to reduce alert noise?

One option, which a few incident management tools recommend, is to manually analyze your alert history and then create and maintain an exhaustive rules engine. This is definitely not a “set it and forget it” solution. Instead one that requires constant tweaks to account for changes to your environment in order to consistently reduce alert noise.

Another option is to leverage some sort of an AIOps platform that makes use of machine learning and AI to analyze your alerts historically and in real-time to automatically correlate, roll-up and prioritize your alerts and events. Because the algorithms employed constantly adapt to changes in your incident workflow, there’s no need to constantly be tuning a rules engine.

At the end of the day, your team is already attempting to reduce alert noise in one way or another, so why not let a machine do it more efficiently and more accurately so you can focus on more important issues?

The problem with root cause analysis

The next area ripe for automation in the incident management workflow is going to be optimizing the time spent searching for and correlating all the relevant logs, events and metrics that pertain to an alert. This process is vital for conducting informed root cause analysis. Why? Because ideally you want to put into production impactful and permanent solutions vs temporary fixes. To do this requires getting full context and as much visibility as possible into all the relevant data that pertains to the alert. Engineers are generally not satisfied with “bounce the server and hope the problem goes away” type of solutions and would rather invest the time to do proper root cause analysis.

However, when you are busy fighting fires, the luxury of performing detailed root cause analysis isn’t always possible. As tempting as it is to just restart a service or redeploy a container with additional resources, what if you could easily determine that the true cause of an issue was a misconfiguration in Ansible or poor test coverage in Jenkins? These types of solutions reach back further into the DevOps toolchain to address problems that will sooner or later manifest themselves again.

AIOps platforms makes it easy to get all the relevant data (whether it be logs, events or metrics) inside the alert itself so you don’t have to context switch between tools, download files and cut and paste data into spreadsheets or a doc for analysis. 

So far we’ve managed to save time by reducing alert noise and automating the process of getting all the relevant data into the alert itself, what’s next? How about saving time the time spent searching for solutions you know you’ve implemented in the past that could be used to solve the alert you are actively triaging?

Avoiding “reinventing the wheel”

One of the things all engineers are stereotypically bad about is documentation. SREs are no different when it comes to documenting solutions they’ve deployed. Even if the solutions are documented they are often in a format that doesn’t lend themselves to being easily indexed, searched and surfaced for reuse at a later date.

Ask yourself:

“How often have I wasted time hunting down solutions I know exist but our ‘hidden’ in a Slack conversation, email thread, Wiki, runbook or worse someone else’s head?”

Here is an example where SignifAI makes it easy to document solutions using a wizard driven knowledge base creation system.

uptime-3-610x372.png

SignifAI automatically correlates solutions from the past or industry best practices with alerts happening in real-time and surfaces them with a probability score.

uptime-5-610x786.png

Machine learning is only going to be as good as the data set it has to work with (think logs, events and metrics vs just metrics) and how much supervised learning can be applied to the recommendations that the algorithms produce. SignifAI has made it easy for users to vote up or down and provide feedback to any solutions it recommends. Ultimately, this means that the more you interact with the recommendation engine, the greater the accuracy of recommended solutions in the future. More accuracy means less time spent looking for solutions you know exist.

Become proactive vs reactive in your approach to uptime

So, at this point if you’ve succeeded in reducing alert noise, the time spent gathering data and hunting down fixes…what are you going to do with all this extra time you’ve now gained?

“If we had the time to rearchitect our application, 80% of our problems would go away.”

“If we had the time to break up our monolithic service into microservices, we could minimize the impact of problems when they occur.”

When engineers are given additional time in their day, they will inevitably want to work on creative and interesting problems that have a far greater chance of actually improving uptime then the status quo of “fighting fires” ever will.

I believe that AI and machine learning are the perfect technologies to help automate tasks your team takes no pleasure in doing and at the end of the day, is not a good use of their time.