Achieving Increased Uptime with Machine Intelligence

I am pleased to announce today the general availability of SignifAI’s machine intelligence platform! This product release is the realization of all the hard learned lessons of my colleagues and my last 10 years working in TechOps.

What is SignifAI?

SignifAI is a cloud-based machine intelligence platform that integrates mostly over API with over 60+ monitoring tools (commercial and open source) like New Relic, AppDynamics, Splunk, Prometheus, Nagios and many more. It gives TechOps fast answers and predictive insights to the most critical issues that threaten system uptime. We accomplish this by combining machine intelligence with a TechOps team’s expertise, not generic learning algorithms.

Why SignifAI?

In our experience as TechOps engineers, despite having a variety of monitoring tools, we still operated in a highly reactive mode. Because we were constantly fighting
fires, we never found the time to implement the processes and systems that allowed us identify issues before they caused outages. Frustrated by the available solutions, we decided last year to take a new approach at uptime using machine intelligence. Specifically, the problems we wanted to solve were:

  • No more siloed data: Monitoring systems with incompatible data formats made it challenging to make meaningful correlations across the application and infrastructure stack. Being able to create a meaningful understanding between triggered events and multiple time-series streams is not easy.

  • Being proactive in the collection phase: In the past, we have missed very important events from existing solutions. We have learned (the hard way) that waiting for events is not enough. Being proactive in the collection phase has huge advantage but it’s hard and require expertise.

  • Alert noise: Having lots of monitoring tools lead to alert overload making it hard to prioritize the most important issues.

  • Inefficient root cause analysis: Correlating large volumes of log, event and metrics data in real-time was challenging even for the most experienced engineers.

  • Cumbersome knowledge capture: We often struggled with how to document and share “lessons learned” so that applicable solutions from the past could be applied to currently open issues in an automated way. We always wanted something that can detect potential issues based on our previous experience.

How SignifAI works

“A picture is worth a thousand words” as they say and I think the best way to understand exactly how SignifAI works, is to see it in action. Check out this short 2 minute demo that walks you through all the key features of the platform including:

  • Collecting data with sensors

  • Responding to alerts

  • Performing root cause analysis

  • Capturing your expertise and knowledge

  • Working with predictive Insights and Answers™

In upcoming blogs, I’m going to review in depth some of the technologies we are using such as:

  • Our data pipeline: Focusing on active collection over APIs.

  • Algorithms: Some of the outliers statistical algorithms we have decided to implement, especially for time-series data.

  • Event collection and processing: Our correlation engine, decision and evaluation engine and why our approach is so different and focused for the TechOps use case.

  • Open source: Our contribution to open-source and why we have decided to adapt Snap Open Telemetry framework and extend it.

Getting started is quick and enabling integrations to the monitoring tools you already use is simple. The typical time to the first predictive insight is less than 20 minutes.

SignifAI believes that when TechOps teams deliver more uptime, they can find the time to work on more complex problems that require creative solutions — precisely the things machines can’t do!

In closing, I would like to emphasise our commitment to the TechOps community. There are great open source monitoring and alerting solutions out there as well as other awesome vendor tools. We are not in a mission to replace and consolidate those. We also believe that we should not capture your data and knowledge without allowing you the ability to export it at any given time. That’s why we have invested in full exporting functionality. Our believe and goal is making our day to day as a community better by empowering people.

Machine Intelligence is a very broad definition, and we are totally aware many times abused and overused. That is exactly why we are bringing to market the first release which is based on many years of experience and a combination of computational techniques but which puts the expert human in the center.

10 Open Source Projects I Recommend as You Are Building Your Monitoring Stack

The open source tools and libraries reviewed below cover the different layers in the stack: from infrastructure, host based metrics, all the way to containers, logs, performance measurement, and specific instrumentations. These are our top ten recommendations, which says a lot given the plethora of open source tools and libraries available to DevOps engineers.

Data Visualisation – Grafana


Grafana offers a library of cool dashboards allowing you to build an at-a-glance view of your applications and infrastructure. It is most commonly used for visualizing time series data for infrastructure and application analytics. It is built for multi-person collaboration and allows you to easily share dashboards.

Grafana dashboards are highly configurable, offering features such as graph styling, drag and drop panels, template variable definition control, and full support for Elasticsearch query based search. It also integrates with a number of other tools and data sources such as Influxdb, ElasticSearch, Prometheus and more to give you a rich monitoring toolset. It’s easy to start using Grafana as the basic installation requires only a client side browser.

Serious Searching – Elasticsearch



Still the standard in our estimate, Elasticsearch is a RESTful API driven solution that offers intelligent and powerful search features. By utilizing Lucene as its core search engine, It provides you with a tool to tweeze out the data you really need from your analyses, quickly and accurately. Elasticsearch works by taking real world entities and storing them as structured JSON documents. This makes your data available immediately for fast searching. It can take a little time and a steep learning curve to optimize, but it’s worth it.

Great Metric Views – Graphite



Graphite1-450x368.png

Having great performance for your mass scale solutions is a top priority. To get this high level of performance you need to have insight. This is where Graphite can really help. This is a tool with a long history that allows you to manage metrics and visualizations. Although it doesn’t deliver the metrics itself directly, pretty much every observability open source library or tool out there supports Graphite. Once you have those metrics, Graphite offers a powerful tool to visualize them. It is an ideal tool for customization for individual environments but can have some challenges in scaling.

System Monitoring and Alerting Toolkit – Prometheus

Having a configurable and flexible toolkit for your system monitoring and alerting is essential for any DevOps professional. Prometheus is a multiple component based kit, suitable for any numeric based time series; it can store over a million time series in a single instance. It is fast to configure and get up and running, offering a GUI based dashboard, an alert manager, and a command line querying tool. It also has support for multi-dimensional data collection. Prometheus is definitely the new Graphite replacement. With many more plug-ins and connectivity and many integration points, if you are thinking about your next observability solution – it should be at the top of your list.

One Framework to Rule Them All – Sensu

The Sensu Monitoring Framework is a monitoring service and metric analysis system for servers, applications, etc. Written in Ruby, it also works with any programing language web application. It is a tool meant for consolidation, i.e. it offers a single platform for monitoring all of your company resources and even third party API’s. It’s a pretty simple tool to setup and can be provisioned using Puppet and other popular configuration management tools.

Sensu was built to accommodate highly scalable cloud infrastructures, so it’s really easy to scale with your infrastructure. There are quite a few tutorials and good blog posts to help with Sensu configuration. Our friends from AppsFlyer also gave a nice talk on the topic.

Best of the Rest – Pyformance, Snap, Dropwizard, cAdvisor, OpenTracing

Pyformance

Pyformance offers a bunch of performance metrics libraries in Python. It works with reporters such as hosted Graphite and Carbon. Pyformance is a simple toolset that offers a way to capture performance measurements and statistics. The documentation has useful sample code to use with the reports.

Snap

Snap is an API based toolkit which can be integrated with servers. It is a telemetry agent and not a full blown analytics application. One of Snap’s benefits is excellent security configuration, including SSL API encryption, and encryption of payload between components. Snap is built using three component plugins, that are pretty intuitive to setup. Plugin one, collects telemetry data; plugin two, converts data for reuse and storage; and plugin three publishes this data. It integrates with a number of common systems such as Facter and OpenStack. Grafana can also be fed using Snap.

Dropwizard

Dropwizard is a well-supported Java framework (or set of libraries depending on your view) with very good documentation. It has simple, out of the box support for metric collection and logging. It works at phenomenal speed and has a lot of community support. The Dropwizard metrics library works with Graphite and a number of reporters such as Splunk and DataDog. A slight downside is that errors are handled as plain text and not as JSON.

cAdvisor

cAdvisor1-300x194.png

cAdvisor is a visualization tool with native support for Docker and potentially any other container. The raw data can be exposed using a RESTful API. It is able to produce graphs of server performance and resource consumption. It is best used with a single docker host. It doesn’t yet have any robust alerts, but there is a roadmap item to advise on container performance. One of the common integration use cases is to use cAdvisor and hook it with Prometheus. CenturyLink wrote an excellent tutorial about that.

OpenTracing

OpenTracing supports a number of common programing languages and is a standard for application instrumentation for simplified tracing. OpenTracing allows you to create a ‘trace’ across the application layer by instrumenting your applications using the OpenTracing API.

Application performance tracing is a much more complex topic and I will elaborate on that specific topic in a specific post in the future.