Turning The Page: News & Reflections

February 06, 2019 by Guy Fighel in AIOps, Machine Intelligence

With the start of 2019 comes an exciting announcement: SignifAI is joining New Relic! Our teams are pumped to work together on our shared vision of bringing machine intelligence to DevOps and SRE teams. You can learn more about the acquisition, what’s coming next, and New Relic’s perspective in this blog post.

As we’ve been taking the first steps of this new journey and working with the New Relic team on the direction for our team and technology, I’ve been reflecting on how SignifAI has evolved throughout the past three years. Here are some thoughts about the past, the future, and what we’ve learned...

SignifAI - The Beginning

The idea for SignifAI originated from experiences my ops team at a prior company had as we scaled our infrastructure to keep up with a quickly growing product. We found ourselves searching for a tool that could cut through the noise of a complex monitoring stack and show us only the things that were most important, in a way that felt as natural and insightful as though we’d done all the manual investigation ourselves. At that time (almost 6 years ago), machine learning was a buzzword used only in academia, the concept of AIOps was nonexistent, and the industry’s shiny new slogans were BigData and Hadoop.

I remember spending days searching for solutions, as I had a fairly large budget, and couldn’t find any platform or tool to do what I had envisioned. I was frustrated…. How could it be? I remember starting deeper technical research on areas I believed could help to solve our problem and along the way did massive reading, researching, and hands-on experimentation. I learned a lot about data pipelines and ingestion, real-time analysis, map reduce jobs, distributed systems, clustering, and many different algorithms. I actually focused mainly on solutions from completely different domains to try and learn from them and see if they could apply to operations.

After tons of research and experimentation with real production high volume data, lots of failure and lots of learning, we started to build the first version of SignifAI. In the beginning, it was a core expert system integrated into an automated pipeline with some logic, some automation and some minimal ML on events and time-series data. By early 2014, we had a core system (completely command-line based with a bunch of config files) finding correlations in our monitoring data and showing us only the most important issues. After experiencing the technical success, I had this internal conviction - we achieved something great here, and we could take the same ideas to create a SaaS-based platform to help other teams.

I wasn’t kidding, and I knew there was still a lot to experiment with and tons of work to build a real product, but that spark of belief - the passion of solving a large and difficult problem for other teams in a relatively generic way, and the proof points we achieved - was what motivated me to decide to rebuild it as an independent product. SignifAI was born with a strong purpose to empower other SRE and DevOps teams with powerful technology and an open-platform approach. We were (and still are) on a mission to change the way digital businesses analyze and understand their technical production environment’s uptime, reliability and availability. Delivered as a SaaS-based machine intelligence platform, and connected to your existing tools and workflows, we help Site Reliability Engineers maximize their day-to-day service level objectives.

As SignifAI grew, we developed features that shaped the core value of the platform: integrations for over 60 sensors and our innovative approach for Active Collection vs. a webhook-only, passive approach; the Control Center, a single pane of glass for monitoring across your full stack; Teams, empowering our enterprise users with more fine-grained control; and Decisions, which gave teams access to better understand and customize the logic that drives correlated Issues in the platform. We launched special offerings for teams using Prometheus and OpenShift. We released Chewie, a streamlined version of the platform that plugs directly into teams’ existing incident management services.

Through this journey, we learned more than we could have imagined about the quickly-changing world of Site Reliability Engineering and the elements of a successful product.

What We’ve Learned

In the past 3 years, we’ve talked to hundreds of DevOps and SRE teams about their experiences: the stress of being on-call, struggles to maintain an infrastructure stack that constantly becomes more complex, victories in proactive problem-solving, and hopes for the future of IT operations. There have been many lessons along the way - here are three of them:

Attack shared frustration

Every team we worked with is different in the tools they use, their authority and responsibility structure, and the way they measure success, but they all share the same vision: a streamlined, highly automated system that minimizes stressful and costly issues and enables engineers to do proactive, sustainable, creative work.

The greatest challenge we had in developing the SignifAI platform was designing features that were useful and accessible to many kinds of teams - everything from traditional NOC/ops to Google-esque, highly automated SRE environments. Our best successes came in prioritizing solutions to problems that were shared by every team, regardless of their size or level of sophistication: easy-to-integrate sensors that didn’t take hours to set up, simple links between Issues and communication/collaboration tools, a combined view for alerts from multiple tools, etc. SignifAI’s most-loved features worked towards the shared SRE vision. It might seem pretty obvious, but it was actually not. There were tons of tradeoffs we needed to make in order to generalize as much as possible for a working solution that met the majority of our users and the industry. Remember, this is a pretty fragmented market with various solutions; thirds party, open-source and custom-built internal tools, each with different flavors and technical requirements.

Prioritize accessibility and understanding

All of the sophisticated tech in the world isn’t useful to our users unless they can see, understand, and tailor the system’s logic to fit their system. We learned that it was imperative to make sure every decision SignifAI made, from correlating and categorizing important Issues to suppressing or delaying noisy ones, was easy for users to understand and give feedback on. When we developed the Decision engine, we kept this core idea in mind: machine learning is only as useful as it is accessible. We see other solutions making the mistake of considering the algorithms and models smarter than the person on call. Pretty early on, we set one of our core values to view our platform as an augmented team member and as an extension to humans - not a replacement. We have also learned that the use case we are solving for is simply not one that can tolerate time-consuming model training, and we could not expect the first responders during an on-call shift to focus on training and adjustments.

Don’t be afraid to adapt and change

Like any startup, SignifAI went through a series of changes as we worked to find a suite of features that served our customers’ needs. One of the major lessons we learned from working with customers is that for teams who already have established workflows around incident management, triaging, and on-call collaboration, introducing a new tool that sits “in the middle” of the stack can be incredibly different and requires a lot of trust and training. Our core platform, which uses sensors to connect to monitoring tools and pushes correlated Issues to a triaging tool, enabled us to gather the most (and most useful) data from customers’ systems without needing tons of configuration. However, the workflow change introduced by adding the platform didn’t work for every team. That’s why we introduced Chewie, a solution that plugs directly into teams’ existing incident management tools.

Adapting and introducing new ideas can feel terrifying, like giving up on a dream - but the most powerful lesson I’ve learned throughout SignifAI’s journey has been that adaptation is the greatest opportunity to create something better than you could have imagined. Instead of treating changes as “letting go,” treat them as chances for creative freedom and a more open mind. As long as you’re still working in service of your big-picture vision, changes are healthy and could represent a breakthrough for your product.

Looking Forward

Joining the New Relic team is incredibly exciting - I can’t wait to see what comes from our shared ambitions of creating a platform that increases automation, truly understands problems, infers reasoning and suggests solutions using Applied Intelligence. There is a lot to write about the decision we took, the reasons we chose to join New Relic, and what has changed over time, but one thing is for sure - our belief and vision remain the same. When I picture the future of IT ops, I hope for more solutions like SignifAI and New Relic that help reduce the anxiety of being on-call and empower SREs to do their best work every day. We are committed to continue pushing ourselves and our shared product to get there as fast as possible.

A Deeper Look into Machine Learning Algorithms & Natural Language Understanding for Site Reliability Engineers

October 22, 2018 by Guy Fighel in Machine Intelligence, AIOps

Lately we introduced SignifAI Decisions – the most flexible, intuitive correlation engine for SRE and DevOps teams. Beyond activating automatically generated logic and building basic Decisions to reduce alert noise, there are a whole host of features that help create stronger correlations and give you deeper insight into your production system. In this post I'll provide a broad set of examples and principles to machine intelligence and how I believe anyone should think about solving the SRE use-case with advance intelligence capabilities. Obviously my examples here are given using the SignifAI platform, but one could take the same principles and explore other solutions and even implement them as a standalone proprietary solution. My goal here is to provide more information and context into the way I believe the market needs to start tackling those hard problems.

Consolidate events with anomaly detection

One of the first aspects to consider when it comes to evaluating multiple events and information is time and volume. The narrower the time window for evaluating those events and the ability to control the repetition aspects are, the higher control over a broad set of parameters will be achieved. In the advanced mode of SignifAI’s Decision builder, you can specify a timeframe and the minimum number of incidents for a Decision. The time frame is the maximum amount of time between incoming incidents in order for them to be correlated based on the logic you specify in the builder. Specifying a shorter timeframe for broader and more generic logic and a longer timeframe for very narrow logic can help ensure the accuracy and relevance of your correlations.

You can also specify the minimum number of incidents that need to match the Decision logic before being correlated. This is useful in cases where a large number of incoming incidents, or a spike in the average incident volume, would indicate a correlation – for example, in the event of a datacenter outage, when several outage or unresponsive incidents could be created for totally separate applications that are all dependent on the same datacenter.

Change Issue priority based on a correlation event

Individual incidents usually have a priority specified by their source – for example, outage events may be classified as “critical,” while minor blips in performance could be “medium.” However, when several incidents are correlated, the meaning of the event could change. In cases like these, you may want the priority of the correlated Issue to be automatically updated in order to reflect this insight. For example, a spike in low-priority incidents could mean an underlying, higher-priority problem. In the Decision builder, you can specify the priority of Issues that are correlated based on your logic – by default, In SignifAI we use the higher priority between the incoming incidents.

Specify the right measure of similarity for your use case

A similarity algorithm that works perfectly for short strings, like healthcheck names or event sources, could be counterproductive for comparisons between longer blocks of text (eg. descriptions).

Here are a couple of the most useful similarity algorithms we have seen to work well on real world production data:

The Levenshtein distance, also known as edit distance, between two strings is the minimum number of single-character edits to get from one string to the other. Allowed edit operations are deletion, insertion, and substitution. Some examples:

number/bumble: 3 (number → bumber → bumblr → bumble)
trying/lying: 2 (trying → rying → lying)
strong/through: 4 (strong → htrong → throng → throug → through)

Some common applications of Levenshtein distance include spelling checkers, computational biology, and speech recognition. The default similarity threshold for new SignifAI Decisions is an edit distance of 3 – that can be changed in the Advanced mode of the Decision builder.

The Jaro-Winkler distance - this metric uses a scale of 0-1 to indicate the similarity between two strings, where 0 is no similarity (0 matching characters between strings) and 1 is an exact match. Jaro-Winkler similarity takes into account:

Matching: two characters that are the same and in similar positions in the strings.
Transpositions: matching characters that are in different sequence order in the strings.
Prefix scale: the Jaro-Winkler distance is adjusted favorably if strings match from the beginning (a prefix is up to 4 characters).

This is a useful metric for cases where identical prefixes are a strong indication of correlation.

The Longest Common Subsequence distance (LCS) - is a variation of Levenshtein distance with a more limited set of allowed edit operations.The LCS distance between two strings is the number of single-character insertions or deletions to change one string into another. LCS is most useful in situations where the characters that belong to both strings are the most important – in other words, if there’s a lot of “garbage” or noisy characters in your strings, LCS is a useful metric, since it concentrates on the shared characters to determine similarity.

The Hamming distance - a simpler version of “edit distance” metrics like Levenshtein distance, the Hamming distance between two strings is the number of substitutions required to turn one string into the other. Hamming distance is computed by counting the number of positions with different characters between the two strings. For example, in the strings below, the Hamming distance is 2 – notice the underlined “different characters”:

                           flowers/florets

Hamming distance requires the compared strings to be of equal length. This is a useful similarity metric for situations where the difference between two strings may be due to typos, or where you want to compare two attributes with known lengths(eg. an incident ID or an instance host or Kubernetes pod name in a specific convention).

Cosine distance - is another similarity metric we use in our Decision engine to correlate related issues. It’s most commonly used to compare large blocks of text(for example, incident descriptions) and provides an easy visualization of similarity. To obtain the cosine distance between two text blocks, a vector is calculated for each block to represent the count of each unique word in the block. The cosine of the two resulting vectors is the cosine distance between them. For example, take these two strings:

            It is not length of life, but depth of life.

            Depth of life does not depend on length.

Here are the word counts for these sentences:

it 1 0

is 0 1

not 1 1

length 1 1

of 2 1

life 2 1

but 1 0

depth 1 1

does 0 1

depend 0 1

on 0 1

…and here are those counts represented as a vector:

[1, 0, 1, 1, 2, 2, 1, 1, 0, 0, 0]

[0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1]

The angle between these vectors is about 0.9 (~52 degrees). A cosine distance of 1 is the highest similarity.

As you can see, this metric is less useful for situations where small character differences in words are insignificant, eg. typos – “nagios” and “nagois” would be treated as completely different words.

Also, cosine distance ignores word order in the text blocks – for example, these two sentences have a cosine similarity of 1 even though they read completely differently:

this is the first of the sentences, it is unscrambled

the sentences is unscrambled, the first of this it is

One of my favorites is the Fuzzy Score. Between two strings, a high fuzzy score indicates high similarity. The fuzzy score algorithm works by allocating “points” for character matches between strings:

One point for each matching character
Two bonus points for subsequent matches

Note that there are multiple implementation options for Fuzzy score and the above is the most basic one and we have implemented multiple more sophisticated Fuzzy algorithms (inspired by SeatGeek) that are working much better on incidents text context.

Example: SignifAI / sgnfai

s: 1

g: 1

n: 1

f: 1

a: 1

i: 1

gn: 2

fa: 2

ai: 2

= 12 points

Fuzzy score is most useful for relatively short strings and when there are non trivial patterns or in a specific order.

The SignifAI Decision engine gives you the flexibility to choose the right measure of similarity for your logic (in addition to the automatic capabilities), with many options for different use cases as well as provide active suggestions to help you choose from, based on simulation capabilities running on top of your actual past data. Simulating changes based on real data is critical. It's the only way to show how a change in a particular algorithm or ML model can actually make a difference to your data. I deeply believe it might be less important to provide full model control to the end user (and in the case of Deep Learning models it is sometimes not even possible), but providing explainability, reasoning and simulation capabilities to the expert user is very important to gain trust and especially in the use case of Site Reliability Engineering. So many times I'm getting this comment:

"Well, this is all great and I have no clue what a Fuzzy score means, but I would want to play with the data and change the different options to verify it makes sense before I could completely trust the system...And if I know the Automatic suggestions are using the same models and algorithms, explaining it to me in simple terms would allow me to just trust it as is..."

Compare entire incidents with multiple algorithm options, including machine learning

With SignifAI Decisions, you can create broad logic at the entire incident scale that works seamlessly with your attribute-specific rules. Choose between two correlation metrics:

Jaccard similarity – you can compare the similarity of entire incidents based on keyword matches. Jaccard distance, sometimes referred to as the jaccard similarity coefficient, is one of the most simple measures of similarity to understand – the index, denoted as a percentage (0 being completely dissimilar; 100 being very similar) is calculated with the following formula:

(# of characters in both sets) / (# of characters in either set) * 100

In other words, the Jaccard index is the number of shared characters divided by the total number of characters (shared and un-shared). In the SignifAI Decision builder, you can use Jaccard similarity to compare entire incidents and set a similarity threshold (10 – 99%).

Jaccard distance is very easy to interpret and especially useful in cases with large data sets – for example, in comparing the similarity between two entire incidents (as opposed to one attribute). It is less effective for small data sets or situations with missing data. Read more about the implementation of jaccard distance here.

Categorical clustering – this option uses a machine learning algorithm to continuously determine clusters of similar events among all incoming data. In order for incidents to be correlated based on this operator, they must satisfy two conditions: 1) They must be in the same cluster, and 2) The distance between each incident and the center of the cluster must be less than the threshold distance.

Let’s break those down a little further. The algorithm continuously categorizes incidents into clusters based on an intelligent combination of attribute values and time series data. The center of each cluster is the incident which represents the mode (highest frequency of common attributes) of all the values in the cluster, and the distance from the center indicates how similar each incident is to that mode. In this visualization, each color represents a different cluster:

The threshold you set in the rule builder is normalized to represent the distance between an individual incident and the center of its cluster. A high threshold represents a short distance (high correlation) between two incidents in a cluster.

In the following example, let’s assume incidents 1-4 match all of your attribute-specific logic (eg. source equals nagios). Using categorical clustering, if you set a threshold of 95%, no incidents would be correlated. A threshold of 75% catches incidents 1 and 2, and a threshold of 0% catches incidents 1, 2, and 3. Incident 4 would never be correlated, since it’s not in the same cluster.

Categorical clustering is a great approach for incidents that may look fairly different at the attribute-specific level, but are related at the entire incident/time series data level. For example, incidents 2 and 3 may not be correlated by a high-threshold Jaccard distance, but could be correlated with categorical clustering.

You can learn more about categorical clustering algorithms here.

Stay tuned for another post about deep clustering, another comparison algorithm that’s currently in private beta but is implementing a neural net approach for automatic classification - trained explicitly on incidents data.

Automatically compare custom or dynamic attributes, no tagging required

With subtree logic in the SignifAI Decision builder, you don’t need to know the exact name of each event attribute – you can build flexible logic based on just a prefix string, like aws. No need to spend hours tagging individual attributes, or updating individual Decisions if your event schema changes – it just works. This key attribute when designing an intelligent system is essential. We really want the system to be usable from the, and when it comes to digital infrastructure, systems and architecture keep changing and dynamically adapted to the business needs. It would be impossible to demand the users to keep tagging, training or informing an intelligent solution with those constant changes. A real intelligent system puts much emphasis on the internal data modeling, learning from the data itself and avoiding as much as possible user involvement. That's not to say that we are going to minimize any user feedback which can form a better-supervised model.

Leverage automatic NLP classification to determine alert symptoms

One of the most unexplored fields when it comes to incident management is natural language understanding/processing. If you think about it, the most useful information in the incident data is actually text based. Because of that - it is actually a perfect candidate to NLP techniques for classification and reasoning. SignifAI’s NLP classifier runs machine learning algorithms over all of your incoming data in order to determine the best-matching predefined classes for each event, and then exposes those classes in the Decision engine to provide context for stronger correlations.

Here is one example: the classes and subclasses that SignifAI automatically identifies are “symptoms” of your production system, with the highest-level classes defined by the “4 golden signals” described in the Google SRE book, plus one more (availability):

Errors
Load
Latency
Saturation
Availability

We keep training the models on an SRE/DevOps incident dataset, so that you don’t need to do any tagging or configuration work to use these classes. They’re available in the decision builder or in Suggested Decisions automatically.

Conclusions

There is so much more to write about, and I'm going to continue in the coming weeks. In the meantime, here are the most critical points as a take away:

Machine Intelligence for SREs must first take into account the expert user by providing the complete flexibility and broad capabilities as expected by technical users.
A combination of approaches is required to achieve good results. Claiming a specific approach is probably not going to cut it for the SRE use case.
Explainability of the models and automatic capabilities is highly essential to gain the user's trust and engagement. Being able to show and explain why a specific action was taken is a combination of controlling the model and involving the expert user is key.
Categorical classification, combined with multiple similarities algorithms, NLP approaches, supervised and unsupervised ML models running in parallel and overtime is a combined approach that provides immediate value to the user and allows improvements over time based on historical and real-time data.

One last point, I always admired John Allspaw for this blog post. I think it is very accurate that software will assist a team of expert humans who will always be smarter than a generic code. However, I also believe that the market today is hardly taking the right approaches when it comes to the specific SRE use case. It is a very sophisticated use case and a significant problem to try and solve, but not many AI/ML data scientists and researchers have the depths and understanding of the user and the use case, and so we are finding ourselves with generic intelligent models that are hardly relevant to the specific use case.

We are here to change that.

With Clarity Comes Focus: How to Reduce SRE Team’s Cognitive Load

September 10, 2018 by Guy Fighel in AIOps, SRE

With the success of the Google SRE book, more and more teams are transitioning from a traditional sysadmin/”ops team” model for managing production incidents towards the SRE ideal - an empowered, engineering-focused team that spends at least 50% of their time on creative work that helps teams move towards more automated and sustainable infrastructure. But cutting down to half of the current time your team spends fire-fighting and doing other traditional “ops” tasks isn’t easy - it requires bold long-term vision, strategic planning, and the right tools for your team.

With this challenge in mind, a natural inclination for SRE leaders is to go searching for data - how rough is the situation right now? Where do their teams measure up in terms of some common SRE KPIs, like mean time to understand (MTTU) and mean time to respond (MTTR) to incidents? How frequently are their team members woken up at night with an ops problem?

Luckily, this information is probably recorded in teams’ existing incident management platforms - but busy SRE teams don’t have a ton of time to comb through it. This is why many AIOps and incident management tools have introduced complex analytics features, happily churning out histograms and pie charts for curious team members to peruse.

Although data visualization is important, in production systems with lots of noisy alerts, it may not be as helpful as we’d hope: Underlying issues can hide in “good” data.

Let’s use mean time to acknowledge as an example. A decrease in MTTA, which is easily determined from the per-incident statistics from your platform, could look like improvement for an SRE team. But maybe the reason for this drop is that some of your systems have started producing some flapping alerts, leading the team to just ack as soon as they’re paged - the SRE equivalent of subconsciously hitting the snooze button or mass-deleting emails. In order to find this truth, you’ll have to dig deeper into the data or talk to a team member.

It’s hard to maintain a single source of truth

Most modern production systems include a bevy of tools that all have their own definitions of “incidents” or “alerts,” making it tricky to understand analytics that that include data points across multiple sources. For example, let’s look at a simple metric - the number of incidents a team sees per week. Does this include:

Every alert that triggers in a monitoring system?
Every incident a team member is paged for?
Repeats of the same alert/incident?
Alerts that resurface after a snooze period?
Multiple incidents that occur at the same time for one underlying issue?
Only incidents that actually require human action?

Chances are, analytics for the “number of incidents” are going to mean different things in each tool, and any one of those numbers may or may not be the most meaningful choice for a team.

Your team has the most valuable perspective

One of the foundational principles of Kaizen, the Japanese principle of continuous improvement, is that team members at all levels should be involved in the evolution of process. This is because even with the fanciest analytics platform in the world, the people doing day-to-day work still have the greatest insight into the hangups and struggles of the process. Using incident analytics to establish a baseline is a great idea, but the more nuanced steps toward SRE greatness don’t require fancier graphs - they just require talking to the team.

A clearer picture

Built-in tools to help you visualize your data can be useful, but in a noisy production environment, an overload of graphs and charts will actually make life harder. An underlying reason for the potential issues with an analytics-first approach to SRE process improvement is that most production systems still have lots of noisy alerts.

Instead of looking for new ways to visualize the noise, what if there was a way to create greater clarity at the alert level first?

Through layers of machine learning-driven filters and correlation logic, SignifAI helps consolidate and streamline incidents across all of your platforms so your team only gets paged for what matters.

With this clarity and context, you’ll be able to answer questions about your team’s progress and production environment health much more quickly, and spend less time pushing through the noise.