The open source tools and libraries reviewed below cover the different layers in the stack: from infrastructure, host based metrics, all the way to containers, logs, performance measurement, and specific instrumentations. These are our top ten recommendations, which says a lot given the plethora of open source tools and libraries available to DevOps engineers.

Data Visualisation – Grafana

Grafana offers a library of cool dashboards allowing you to build an at-a-glance view of your applications and infrastructure. It is most commonly used for visualizing time series data for infrastructure and application analytics. It is built for multi-person collaboration and allows you to easily share dashboards.

Grafana dashboards are highly configurable, offering features such as graph styling, drag and drop panels, template variable definition control, and full support for Elasticsearch query based search. It also integrates with a number of other tools and data sources such as Influxdb, ElasticSearch, Prometheus and more to give you a rich monitoring toolset. It’s easy to start using Grafana as the basic installation requires only a client side browser.

Serious Searching – Elasticsearch

Still the standard in our estimate, Elasticsearch is a RESTful API driven solution that offers intelligent and powerful search features. By utilizing Lucene as its core search engine, It provides you with a tool to tweeze out the data you really need from your analyses, quickly and accurately. Elasticsearch works by taking real world entities and storing them as structured JSON documents. This makes your data available immediately for fast searching. It can take a little time and a steep learning curve to optimize, but it’s worth it.

Great Metric Views – Graphite

Having great performance for your mass scale solutions is a top priority. To get this high level of performance you need to have insight. This is where Graphite can really help. This is a tool with a long history that allows you to manage metrics and visualizations. Although it doesn’t deliver the metrics itself directly, pretty much every observability open source library or tool out there supports Graphite. Once you have those metrics, Graphite offers a powerful tool to visualize them. It is an ideal tool for customization for individual environments but can have some challenges in scaling.

System Monitoring and Alerting Toolkit – Prometheus

Having a configurable and flexible toolkit for your system monitoring and alerting is essential for any DevOps professional. Prometheus is a multiple component based kit, suitable for any numeric based time series; it can store over a million time series in a single instance. It is fast to configure and get up and running, offering a GUI based dashboard, an alert manager, and a command line querying tool. It also has support for multi-dimensional data collection. Prometheus is definitely the new Graphite replacement. With many more plug-ins and connectivity and many integration points, if you are thinking about your next observability solution – it should be at the top of your list.

One Framework to Rule Them All – Sensu

The Sensu Monitoring Framework is a monitoring service and metric analysis system for servers, applications, etc. Written in Ruby, it also works with any programing language web application. It is a tool meant for consolidation, i.e. it offers a single platform for monitoring all of your company resources and even third party API’s. It’s a pretty simple tool to setup and can be provisioned using Puppet and other popular configuration management tools.

Sensu was built to accommodate highly scalable cloud infrastructures, so it’s really easy to scale with your infrastructure. There are quite a few tutorials and good blog posts to help with Sensu configuration. Our friends from AppsFlyer also gave a nice talk on the topic.

Best of the Rest – Pyformance, Snap, Dropwizard, cAdvisor, OpenTracing

Pyformance

Pyformance offers a bunch of performance metrics libraries in Python. It works with reporters such as hosted Graphite and Carbon. Pyformance is a simple toolset that offers a way to capture performance measurements and statistics. The documentation has useful sample code to use with the reports.

Snap

Snap is an API based toolkit which can be integrated with servers. It is a telemetry agent and not a full blown analytics application. One of Snap’s benefits is excellent security configuration, including SSL API encryption, and encryption of payload between components. Snap is built using three component plugins, that are pretty intuitive to setup. Plugin one, collects telemetry data; plugin two, converts data for reuse and storage; and plugin three publishes this data. It integrates with a number of common systems such as Facter and OpenStack. Grafana can also be fed using Snap.

Dropwizard

Dropwizard is a well-supported Java framework (or set of libraries depending on your view) with very good documentation. It has simple, out of the box support for metric collection and logging. It works at phenomenal speed and has a lot of community support. The Dropwizard metrics library works with Graphite and a number of reporters such as Splunk and DataDog. A slight downside is that errors are handled as plain text and not as JSON.

cAdvisor

cAdvisor is a visualization tool with native support for Docker and potentially any other container. The raw data can be exposed using a RESTful API. It is able to produce graphs of server performance and resource consumption. It is best used with a single docker host. It doesn’t yet have any robust alerts, but there is a roadmap item to advise on container performance. One of the common integration use cases is to use cAdvisor and hook it with Prometheus. CenturyLink wrote an excellent tutorial about that.

OpenTracing

OpenTracing supports a number of common programing languages and is a standard for application instrumentation for simplified tracing. OpenTracing allows you to create a ‘trace’ across the application layer by instrumenting your applications using the OpenTracing API.

Application performance tracing is a much more complex topic and I will elaborate on that specific topic in a specific post in the future.