Turning The Page: News & Reflections

With the start of 2019 comes an exciting announcement: SignifAI is joining New Relic! Our teams are pumped to work together on our shared vision of bringing machine intelligence to DevOps and SRE teams. You can learn more about the acquisition, what’s coming next, and New Relic’s perspective in this blog post.

As we’ve been taking the first steps of this new journey and working with the New Relic team on the direction for our team and technology, I’ve been reflecting on how SignifAI has evolved throughout the past three years. Here are some thoughts about the past, the future, and what we’ve learned...

SignifAI - The Beginning

The idea for SignifAI originated from experiences my ops team at a prior company had as we scaled our infrastructure to keep up with a quickly growing product. We found ourselves searching for a tool that could cut through the noise of a complex monitoring stack and show us only the things that were most important, in a way that felt as natural and insightful as though we’d done all the manual investigation ourselves. At that time (almost 6 years ago), machine learning was a buzzword used only in academia, the concept of AIOps was nonexistent, and the industry’s shiny new slogans were BigData and Hadoop.

I remember spending days searching for solutions, as I had a fairly large budget, and couldn’t find any platform or tool to do what I had envisioned. I was frustrated…. How could it be? I remember starting deeper technical research on areas I believed could help to solve our problem and along the way did massive reading, researching, and hands-on experimentation. I learned a lot about data pipelines and ingestion, real-time analysis, map reduce jobs, distributed systems, clustering, and many different algorithms. I actually focused mainly on solutions from completely different domains to try and learn from them and see if they could apply to operations.

After tons of research and experimentation with real production high volume data, lots of failure and lots of learning, we started to build the first version of SignifAI. In the beginning, it was a core expert system integrated into an automated pipeline with some logic, some automation and some minimal ML on events and time-series data. By early 2014, we had a core system (completely command-line based with a bunch of config files) finding correlations in our monitoring data and showing us only the most important issues. After experiencing the technical success, I had this internal conviction - we achieved something great here, and we could take the same ideas to create a SaaS-based platform to help other teams.

I wasn’t kidding, and I knew there was still a lot to experiment with and tons of work to build a real product, but that spark of belief - the passion of solving a large and difficult problem for other teams in a relatively generic way, and the proof points we achieved - was what motivated me to decide to rebuild it as an independent product. SignifAI was born with a strong purpose to empower other SRE and DevOps teams with powerful technology and an open-platform approach. We were (and still are) on a mission to change the way digital businesses analyze and understand their technical production environment’s uptime, reliability and availability.  Delivered as a SaaS-based machine intelligence platform, and connected to your existing tools and workflows, we help Site Reliability Engineers maximize their day-to-day service level objectives.

As SignifAI grew, we developed features that shaped the core value of the platform: integrations for over 60 sensors and our innovative approach for Active Collection vs. a webhook-only, passive approach; the Control Center, a single pane of glass for monitoring across your full stack; Teams, empowering our enterprise users with more fine-grained control; and Decisions, which gave teams access to better understand and customize the logic that drives correlated Issues in the platform. We launched special offerings for teams using Prometheus and OpenShift. We released Chewie, a streamlined version of the platform that plugs directly into teams’ existing incident management services.

Through this journey, we learned more than we could have imagined about the quickly-changing world of Site Reliability Engineering and the elements of a successful product.

What We’ve Learned

In the past 3 years, we’ve talked to hundreds of DevOps and SRE teams about their experiences: the stress of being on-call, struggles to maintain an infrastructure stack that constantly becomes more complex, victories in proactive problem-solving, and hopes for the future of IT operations. There have been many lessons along the way - here are three of them:

Attack shared frustration

Every team we worked with is different in the tools they use, their authority and responsibility structure, and the way they measure success, but they all share the same vision: a streamlined, highly automated system that minimizes stressful and costly issues and enables engineers to do proactive, sustainable, creative work.

The greatest challenge we had in developing the SignifAI platform was designing features that were useful and accessible to many kinds of teams - everything from traditional NOC/ops to Google-esque, highly automated SRE environments. Our best successes came in prioritizing solutions to problems that were shared by every team, regardless of their size or level of sophistication: easy-to-integrate sensors that didn’t take hours to set up, simple links between Issues and communication/collaboration tools, a combined view for alerts from multiple tools, etc. SignifAI’s most-loved features worked towards the shared SRE vision. It might seem pretty obvious, but it was actually not. There were tons of tradeoffs we needed to make in order to generalize as much as possible for a working solution that met the majority of our users and the industry. Remember, this is a pretty fragmented market with various solutions; thirds party, open-source and custom-built internal tools, each with different flavors and technical requirements.

Prioritize accessibility and understanding

All of the sophisticated tech in the world isn’t useful to our users unless they can see, understand, and tailor the system’s logic to fit their system. We learned that it was imperative to make sure every decision SignifAI made, from correlating and categorizing important Issues to suppressing or delaying noisy ones, was easy for users to understand and give feedback on. When we developed the Decision engine, we kept this core idea in mind: machine learning is only as useful as it is accessible. We see other solutions making the mistake of considering the algorithms and models smarter than the person on call. Pretty early on, we set one of our core values to view our platform as an augmented team member and as an extension to humans - not a replacement. We have also learned that the use case we are solving for is simply not one that can tolerate time-consuming model training, and we could not expect the first responders during an on-call shift to focus on training and adjustments.

Don’t be afraid to adapt and change

Like any startup, SignifAI went through a series of changes as we worked to find a suite of features that served our customers’ needs. One of the major lessons we learned from working with customers is that for teams who already have established workflows around incident management, triaging, and on-call collaboration, introducing a new tool that sits “in the middle” of the stack can be incredibly different and requires a lot of trust and training. Our core platform, which uses sensors to connect to monitoring tools and pushes correlated Issues to a triaging tool, enabled us to gather the most (and most useful) data from customers’ systems without needing tons of configuration. However, the workflow change introduced by adding the platform didn’t work for every team. That’s why we introduced Chewie, a solution that plugs directly into teams’ existing incident management tools.

Adapting and introducing new ideas can feel terrifying, like giving up on a dream - but the most powerful lesson I’ve learned throughout SignifAI’s journey has been that adaptation is the greatest opportunity to create something better than you could have imagined. Instead of treating changes as “letting go,” treat them as chances for creative freedom and a more open mind. As long as you’re still working in service of your big-picture vision, changes are healthy and could represent a breakthrough for your product.

Looking Forward

Joining the New Relic team is incredibly exciting - I can’t wait to see what comes from our shared ambitions of creating a platform that increases automation, truly understands problems, infers reasoning and suggests solutions using Applied Intelligence. There is a lot to write about the decision we took, the reasons we chose to join New Relic, and what has changed over time, but one thing is for sure - our belief and vision remain the same. When I picture the future of IT ops, I hope for more solutions like SignifAI and New Relic that help reduce the anxiety of being on-call and empower SREs to do their best work every day. We are committed to continue pushing ourselves and our shared product to get there as fast as possible.

Achieving Increased Uptime with Machine Intelligence

I am pleased to announce today the general availability of SignifAI’s machine intelligence platform! This product release is the realization of all the hard learned lessons of my colleagues and my last 10 years working in TechOps.

What is SignifAI?

SignifAI is a cloud-based machine intelligence platform that integrates mostly over API with over 60+ monitoring tools (commercial and open source) like New Relic, AppDynamics, Splunk, Prometheus, Nagios and many more. It gives TechOps fast answers and predictive insights to the most critical issues that threaten system uptime. We accomplish this by combining machine intelligence with a TechOps team’s expertise, not generic learning algorithms.

Why SignifAI?

In our experience as TechOps engineers, despite having a variety of monitoring tools, we still operated in a highly reactive mode. Because we were constantly fighting
fires, we never found the time to implement the processes and systems that allowed us identify issues before they caused outages. Frustrated by the available solutions, we decided last year to take a new approach at uptime using machine intelligence. Specifically, the problems we wanted to solve were:

  • No more siloed data: Monitoring systems with incompatible data formats made it challenging to make meaningful correlations across the application and infrastructure stack. Being able to create a meaningful understanding between triggered events and multiple time-series streams is not easy.

  • Being proactive in the collection phase: In the past, we have missed very important events from existing solutions. We have learned (the hard way) that waiting for events is not enough. Being proactive in the collection phase has huge advantage but it’s hard and require expertise.

  • Alert noise: Having lots of monitoring tools lead to alert overload making it hard to prioritize the most important issues.

  • Inefficient root cause analysis: Correlating large volumes of log, event and metrics data in real-time was challenging even for the most experienced engineers.

  • Cumbersome knowledge capture: We often struggled with how to document and share “lessons learned” so that applicable solutions from the past could be applied to currently open issues in an automated way. We always wanted something that can detect potential issues based on our previous experience.

How SignifAI works

“A picture is worth a thousand words” as they say and I think the best way to understand exactly how SignifAI works, is to see it in action. Check out this short 2 minute demo that walks you through all the key features of the platform including:

  • Collecting data with sensors

  • Responding to alerts

  • Performing root cause analysis

  • Capturing your expertise and knowledge

  • Working with predictive Insights and Answers™

In upcoming blogs, I’m going to review in depth some of the technologies we are using such as:

  • Our data pipeline: Focusing on active collection over APIs.

  • Algorithms: Some of the outliers statistical algorithms we have decided to implement, especially for time-series data.

  • Event collection and processing: Our correlation engine, decision and evaluation engine and why our approach is so different and focused for the TechOps use case.

  • Open source: Our contribution to open-source and why we have decided to adapt Snap Open Telemetry framework and extend it.

Getting started is quick and enabling integrations to the monitoring tools you already use is simple. The typical time to the first predictive insight is less than 20 minutes.

SignifAI believes that when TechOps teams deliver more uptime, they can find the time to work on more complex problems that require creative solutions — precisely the things machines can’t do!

In closing, I would like to emphasise our commitment to the TechOps community. There are great open source monitoring and alerting solutions out there as well as other awesome vendor tools. We are not in a mission to replace and consolidate those. We also believe that we should not capture your data and knowledge without allowing you the ability to export it at any given time. That’s why we have invested in full exporting functionality. Our believe and goal is making our day to day as a community better by empowering people.

Machine Intelligence is a very broad definition, and we are totally aware many times abused and overused. That is exactly why we are bringing to market the first release which is based on many years of experience and a combination of computational techniques but which puts the expert human in the center.