Achieving Increased Uptime with Machine Intelligence

I am pleased to announce today the general availability of SignifAI’s machine intelligence platform! This product release is the realization of all the hard learned lessons of my colleagues and my last 10 years working in TechOps.

What is SignifAI?

SignifAI is a cloud-based machine intelligence platform that integrates mostly over API with over 60+ monitoring tools (commercial and open source) like New Relic, AppDynamics, Splunk, Prometheus, Nagios and many more. It gives TechOps fast answers and predictive insights to the most critical issues that threaten system uptime. We accomplish this by combining machine intelligence with a TechOps team’s expertise, not generic learning algorithms.

Why SignifAI?

In our experience as TechOps engineers, despite having a variety of monitoring tools, we still operated in a highly reactive mode. Because we were constantly fighting
fires, we never found the time to implement the processes and systems that allowed us identify issues before they caused outages. Frustrated by the available solutions, we decided last year to take a new approach at uptime using machine intelligence. Specifically, the problems we wanted to solve were:

  • No more siloed data: Monitoring systems with incompatible data formats made it challenging to make meaningful correlations across the application and infrastructure stack. Being able to create a meaningful understanding between triggered events and multiple time-series streams is not easy.

  • Being proactive in the collection phase: In the past, we have missed very important events from existing solutions. We have learned (the hard way) that waiting for events is not enough. Being proactive in the collection phase has huge advantage but it’s hard and require expertise.

  • Alert noise: Having lots of monitoring tools lead to alert overload making it hard to prioritize the most important issues.

  • Inefficient root cause analysis: Correlating large volumes of log, event and metrics data in real-time was challenging even for the most experienced engineers.

  • Cumbersome knowledge capture: We often struggled with how to document and share “lessons learned” so that applicable solutions from the past could be applied to currently open issues in an automated way. We always wanted something that can detect potential issues based on our previous experience.

How SignifAI works

“A picture is worth a thousand words” as they say and I think the best way to understand exactly how SignifAI works, is to see it in action. Check out this short 2 minute demo that walks you through all the key features of the platform including:

  • Collecting data with sensors

  • Responding to alerts

  • Performing root cause analysis

  • Capturing your expertise and knowledge

  • Working with predictive Insights and Answers™

In upcoming blogs, I’m going to review in depth some of the technologies we are using such as:

  • Our data pipeline: Focusing on active collection over APIs.

  • Algorithms: Some of the outliers statistical algorithms we have decided to implement, especially for time-series data.

  • Event collection and processing: Our correlation engine, decision and evaluation engine and why our approach is so different and focused for the TechOps use case.

  • Open source: Our contribution to open-source and why we have decided to adapt Snap Open Telemetry framework and extend it.

Getting started is quick and enabling integrations to the monitoring tools you already use is simple. The typical time to the first predictive insight is less than 20 minutes.

SignifAI believes that when TechOps teams deliver more uptime, they can find the time to work on more complex problems that require creative solutions — precisely the things machines can’t do!

In closing, I would like to emphasise our commitment to the TechOps community. There are great open source monitoring and alerting solutions out there as well as other awesome vendor tools. We are not in a mission to replace and consolidate those. We also believe that we should not capture your data and knowledge without allowing you the ability to export it at any given time. That’s why we have invested in full exporting functionality. Our believe and goal is making our day to day as a community better by empowering people.

Machine Intelligence is a very broad definition, and we are totally aware many times abused and overused. That is exactly why we are bringing to market the first release which is based on many years of experience and a combination of computational techniques but which puts the expert human in the center.