Why Data Cleansing is Important for AIOps

The quantity and the reliance on data by enterprises is spreading like wildfire. In EMC’s The Digital Universe of Opportunities: Rich Data and the Increasing Value of the Internet of Things, analysts estimated that the digital universe is doubling in size every two years and predicted it will multiply 10-fold between 2013 and 2020 – from 4.4 trillion gigabytes to 44 trillion gigabytes.

Unfortunately, deriving meaningful insights from all this data and converting it into action is easier said than done. The tremendous amount of data is one thing — there’s a reason it’s called big data. The bigger issue is bad data quality.

The fact is data isn’t always usable as is, and preparing it so it can be used and data cleaning or data cleansing is typically slow, difficult, and tedious. With more companies applying DevOps principles to big data projects, the delays inherent in the data cleaning process can have serious ramifications and negate the benefits that DevOps is supposed to provide.

That’s because DevOps relies on rapid deployment frequency. That can’t happen when the data science teams and developers who rely on quick access to usable data must spend more time making “bad” data usable than actually using it.

The Need for Clean Data

By some estimates, so-called “dirty” data costs the US economy up to $3.1 trillion a year. That’s not surprising given that poor data quality can lead to inaccurate data analytics results and drive misguided decision making — both of which are detrimental to data scientists and developers alike. It can also expose companies to compliance issues since many are subject to requirements to ensure that their data is as accurate and current as possible.

Process architecture and process management can help reduce the potential for bad data quality at the front end, but can’t eliminate it. The solution, then, lies in making bad data usable by detecting and removing or correcting errors and inconsistencies in a data set or database — data cleansing.

The Challenge of Data Cleansing

Unfortunately, data cleansing is a time-consuming endeavor. A survey conducted by CrowdFlower, a provider of a data enrichment platform for data scientists, reported that data scientists spent 60% of their time on cleaning and organizing data and 19% of their time collecting data sets. That adds up to almost 80% of their time devoted to preparing and managing data for analysis.

Part of the problem is that data cleansing is a complex, multi-stage process. Best practices entail employing a detailed data analysis as a first step in detecting which kinds of errors and inconsistencies must be removed. In addition to a manual inspection of the data or data samples, analysis programs are often needed to gain metadata about the data properties and detect data quality problems.

Software that employs machine learning helps, but because data can come from any number of disparate sources, the data cleansing process also requires getting data into a consistent format for easier usability and to ensure it all has the same shape and schema. Depending on the number of data sources, their degree of heterogeneity, and how bad the quality of the data is, data transformation steps may be required as well. Then, the effectiveness of a transformation workflow and the transformation definitions must be tested and evaluated. Multiple iterations of the analysis, design, and verification steps may also be needed.

After errors are removed, the cleansed data must replace the bad data in the original sources. This ensures that legacy applications have the updated data as well, minimizing potential rework for future data extractions.

There are other time-consuming challenges as well. For example, the available information on the anomalies is often insufficient to determine how to correct them, leaving data deletion as the only option. However, deleting the data means losing information. Then there’s the fact that data cleansing is not a one-time thing. The process must be repeated every time data is accessed or values change.

Data Cleansing for DevOps and Big Data

Data cleansing has important implications for DevOps teams as they take on big data projects. DevOps brings down silos, and promotes collaboration and communication between data scientists, developers and others tasked with analyzing big data and using the insights to drive smart business decision making. Speed is what drives the demand for DevOps, but dealing with the volume, velocity, and variety of big data can’t help but slow down processes whether they are specific to software development or a data science project.  

Big data is complex by nature with its ever-increasing accumulation of unstructured and semi-structured data from a myriad of sources, including sensors, mobile devices, network traffic, Web servers, custom applications, application servers, GPS systems, stock market feeds, social media, as well as data from structured databases, logs, config files, messages, alerts, scripts, development feedback loops, and metrics. … the list goes on. Disparate sources of data translate into equally disparate formats. Data science teams can’t make sense of the data until it’s transformed into a unified form. That creates a significant bottleneck that even DevOps, on its own, can’t overcome.For example, online web application might be sending data in a SOAP/XML format over http, feed might be coming in a CSV file format and devices might be communicating over MQTT protocol.

Consider the challenge facing DevOps teams using metrics, presented as numbers, and logs, presented as text, to understand and improve the performance of code, services, and infrastructure environment in production. Effectively integrating log data and performance metrics within a network can greatly reduce time in resolving critical issues, and simplify the distribution of data to customers and developers.

The problem is that logs and metrics come in varied forms, making correlation and analysis for each difficult and between the two next to impossible. Metric data format is short. It describes measurements beyond the measured value, including type, location, the time of measurement, and grouping. Generated by infrastructure or applications, logs are used to provide operational teams with as much specific detail as possible to help them analyze a specific operational or security event. As a result, they tend to be longer than metrics and can come in a variety of shapes and form. While some logs are standardized, their formats are often defined by the developers.

Data cleansing not only removes errors from both data types. It also transforms log data and metrics to a common format, providing teams with shared views and insights across the entire application environment. That helps speed up issue remediation with code and the frequency of production code updates, as well as helps teams understand the   impact of their code is making at any production stage and scale.

Automated Data Cleansing to the Rescue

Fortunately, as both big data and DevOps become “business as usual,” the need for overcoming the data cleansing hurdle is driving development of technologies that can automate data cleansing processes to help accelerate business analytics. IDC predicts that through 2020, spending on data preparation tools will grow two and half times faster than traditional IT-controlled tools for similar functionality.

Less time spent on data cleansing will result in closer to real-time analysis of incoming data — regardless of its original format, which in turn will produce faster, actionable information. Ensuring data is clean and presented in a common format will also help eliminate rework, enabling DevOps teams to perform quick root cause analyses on issues so problems can be addressed quickly. That’s essential because quick resolution of problems assures that data pipelines can keep flowing continuously.

Not Your Father’s Python: Amazing Powerful Frameworks

When we were getting SignifAI off the ground, one of the biggest decisions we had to make right at the beginning was what our stack would be. We had to make a typical choice: use a relatively new language like Go or an old, solid one like Python. There are many good reasons for choosing Go, especially when looking for high performance, but because we already had significant experience with Python, we decided to continue with it. But it’s important to note that our product and infrastructure must support hundreds of thousands of events per second. As we are collecting massive amount of events and data points from multiple sources for our customers. Each event should be processed in real time as fast as possible, so we do care about latency as well. It’s not trivial, especially as we need to minimize our compute costs and be as efficient as possible.

So we were happy to see that, with the recent widespread adoption of Python 3 and the introduction of tasks and coroutines as first-class citizens in the language, Python has recently stepped up its game. Over the past few years, what started with Tornado and Twisted and the introduction of the asyncio module in Python 3 has continued evolving into a new wave of libraries that disrupt and change old assumptions about Python performance for web applications. Teams that want to build blazing-fast applications can now do so in Python nearly seamlessly without having to switch into a “faster” language.

The challenge of Python’s GIL

Despite Python’s many advantages, the design of its interpreter comes with what some might consider a fatal flaw: the global interpreter lock, or GIL. As an interpreted language that enables multi-threaded programming, Python relies on the GIL to make sure that multiple threads do not access or act upon a single object simultaneously (which can lead to discrepancies, errors, corruption, and deadlocks). As a result, even in a multi-threaded program that technically allows concurrency, only one thread is executed at a time—almost as if it were, indeed, a single-threaded program. This occurs because the GIL latches onto a thread being executed, periodically releases it, checks to see if other threads are requesting entry, then reacquires the original thread. If another thread wants access, the GIL acquires this second thread instead, thus blocking access to the first, and so on.

Multiple threads are necessary for Python to take advantage of multi-core systems, share memory among threads, and execute parallel functions. In turn, the GIL is necessary to keep garbage collection working and avoid conflicts with objects that would destabilize the entire program. The GIL can be a nuisance, especially when it comes to speed. Often, a multi-threaded program runs slower than the same program executed in a single thread because the threads have to wait for the GIL in order to complete their tasks. These delays in running threaded programs is what makes the GIL such a headache for Python developers.

First wave of solutions

So why not just get rid of the GIL? As it turns out, that’s not so simple. Early patches in the 1990s tested removing the GIL and replacing it with granular locks around sensitive operations. While this did increase the execution speed for multi-threaded programs, it took nearly twice as long to execute programs with single threads.

Since then, a number of workarounds have come on the market that are effective only to some extent. For instance interpreters like Jython and IronPython don’t use a GIL, thus granting full access to multiple cores. However, they have their own drawbacks, namely problems with supporting C extensions and, in some situations, running even slower than the traditional CPython interpreter. Another recommendation for bypassing the GIL is to use multiple processes instead of multiple threads. But while this certainly allows true concurrency, multiple processes cannot share data, short processes result in excessive message passing, and copies must be made of all information, often multiple times, which slows down the process. Solutions such as NumPy can be used for serious number-crunching by extending Python’s capabilities by moving parts of the application into an external library. NumPy alleviates some of the GIL’s problems, but it, too, has limitations in that it requires a single core to handle all of the execution.

When Twisted and Tornado came on the scene with Python 2 and early versions of Python 3, they presented intriguing possibilities. They don’t get rid of the GIL, but they—similar to Python’s asyncio—are asynchronous networking libraries that rely on non-blocking network I/O. In essence, these libraries use callbacks to run an event loop on a single thread that allows I/O and CPU tasks to be carried out simultaneously. (Even though these tasks don’t interfere with each other, only one task can be carried out a time in the traditional, multi-threaded process—and this takes advantage of the inevitable downtime.) These libraries also allow the thread to context-switch quickly, furthering its efficiency because tasks are carried out as soon as they’re necessary, without lying idle. While Tornado and Twisted technically use threads and coroutines, they exist under the surface and are invisible to the programmer. Because asynchronous libraries solve the CPU context-switching problem, they are able to run much faster than the older methods and thus have become quite popular. Nonetheless, they’re still not perfect. Callback-based asynchronicity results in long chains of callbacks that make it hard to create exceptions, debug problems, or gather tasks.

With their rise in popularity, Twisted and Tornado set the scene for new libraries that have dramatically changed Python by finally enabling very fast processing.

The new face of Python

The speed at which Python can execute tasks has skyrocketed with the introduction of a new type of library. Here, we review three of these — UVLoop, Sanic, and Japronto—to examine just how dramatic the change in performance can really be.

UVLoop is the first ultra-fast asynchronous framework, which is a drop-in replacement for Python 3.5’s built-in asyncio event loop.Both Japronto and Sanic which are reviewed in this post are also based on UVLoop.

This framework’s core functionality is basically to make asyncio run faster. It claims to be twice as fast as any other asynchronous Python frameworks, including gevent. Because UVLoop is written in Cython and uses the libuv library, it’s able to create a much faster loop than the default asyncio.

A UVLoop protocol benchmark shows more than 100,000 requests per second for 1 KiB in a test, compared to about 45,000 requests per second for asyncio. In the same test, Tornado and Twisted only handled about 20,000.

Performance results of http request with httptools and uvloop against nodejs and Go

Performance results of http request with httptools and uvloop against nodejs and Go

To make asyncio use the event loop provided by uvloop, you install the uvloop event loop policy:

import asyncio
import uvloop
asyncio.set_event_loop_policy(uvloop.EventLoopPolicy())

Alternatively, you can create an instance of the loop manually, using:

import asyncio
import uvloop
loop = uvloop.new_event_loop()
asyncio.set_event_loop(loop)

Sanic is another very fast and stable framework that runs on Python 3.5 and higher. Sanic is similar to Flask but can also support asynchronous request handlers, making it compatible with Python 3.5’s new async/await functions, which give it non-blocking capabilities and enhance its speed even further.Alternatively, you can create an instance of the loop manually, using:

In a benchmark test using one process and 100 connections, Sanic handled 33,342 requests per second, with an average latency of 2.96 ms. This was higher than the next-highest server, Wheezy, which handled 20,244 requests per second with a latency of 4.94 ms. Tornado clocked in at a mere 2,138 requests per second and a latency of 46.66 ms.

Here’s what an async run looks like:

from sanic import Sanic
from sanic.response import json
from multiprocessing import Event
from signal import signal, SIGINT
import asyncio
import uvloop
app = Sanic(__name__)
@app.route("/")
async def test(request):
return json({"answer": "42"})
asyncio.set_event_loop(uvloop.new_event_loop())
server = app.create_server(host="0.0.0.0", port=8001)
loop = asyncio.get_event_loop()
task = asyncio.ensure_future(server)
signal(SIGINT, lambda s, f: loop.stop())
try:
loop.run_forever()
except:
loop.stop()

Japronto is the latest new asynchronous micro-framework designed to be fast, scalable, and lightweight. Officially announced  around January this year. And fast it is: For a simple application running a single-worker process, with one thread, 100 connections, and 24 simultaneous pipelined requests per connection, Japronto claims to handle more than 1.2 million requests per second, compared to 54,502 for Go and 2,968 for Tornado.  

Part of Japronto’s magic comes from its optimization of HTTP pipelining, which doesn’t require the client to wait for a response before sending the next request. Therefore, Japronto receives data, parses multiple requests, and executes them as quickly as possible in the order they’re received. Then it glues the responses and writes them back in a single system call.

Japronto relies on the excellent picohttpparser C library for parsing status line, headers, and a chunked HTTP message body. Picohttpparser directly employs text processing instructions found in modern CPUs with SSE4.2 extensions to quickly match boundaries of HTTP tokens. The I/O is handled by uvloop, which itself is a wrapper around libuv. At the lowest level, this is a bridge to epoll system call providing asynchronous notifications on read-write readiness.

We definitely think Japronto is the most advanced framework so far, but it’s pretty young and requires much more work and stability improvements from the community. However – this is definitely something we are excited to be part of.

Here’s an example of Japronto :

import asyncio
from japronto import Application
# This is a synchronous handler.
def synchronous(request):
return request.Response(text='I am synchronous!')
# This is an asynchronous handler, it spends most of the time in the event loop.
# It wakes up every second 1 to print and finally returns after 3 seconds.
# This does let other handlers to be executed in the same processes while
# from the point of view of the client it took 3 seconds to complete.
async def asynchronous(request):
for i in range(1, 4):
await asyncio.sleep(1)
print(i, 'seconds elapsed')
return request.Response(text='3 seconds elapsed')
app = Application()
r = app.router
r.add_route('/sync', synchronous)
r.add_route('/async', asynchronous)
app.run()

The future of Python is here

Overall, it looks like fast, asynchronous Python might be here to stay. Now that asyncio appears to be a default in Python and the async/await syntax has found favor among developers, the GIL doesn’t seem like such a roadblock anymore and speed doesn’t need to be a sacrifice. But the dramatic shifts in performance come from these new libraries and frameworks that continue building on the asynchronous trend. Most likely, we’ll continue to see Japronto, Sanic, UVLoop, and many others only getting better.

So before moving into a different language in search of faster processing, we encourage exploring the latest capabilities of Python 3.5 and some of the latest modern frameworks described in this post. Oh, and by the way, if you are still using Python 2.x – you are using a legacy version. We highly encourage you to upgrade to the current version (which is the standard).

The main question still pending, however, is how someone can effectively monitor  frameworks and any other highly scaled Python applications in production.

10 Open Source Projects I Recommend as You Are Building Your Monitoring Stack

The open source tools and libraries reviewed below cover the different layers in the stack: from infrastructure, host based metrics, all the way to containers, logs, performance measurement, and specific instrumentations. These are our top ten recommendations, which says a lot given the plethora of open source tools and libraries available to DevOps engineers.

Data Visualisation – Grafana


Grafana offers a library of cool dashboards allowing you to build an at-a-glance view of your applications and infrastructure. It is most commonly used for visualizing time series data for infrastructure and application analytics. It is built for multi-person collaboration and allows you to easily share dashboards.

Grafana dashboards are highly configurable, offering features such as graph styling, drag and drop panels, template variable definition control, and full support for Elasticsearch query based search. It also integrates with a number of other tools and data sources such as Influxdb, ElasticSearch, Prometheus and more to give you a rich monitoring toolset. It’s easy to start using Grafana as the basic installation requires only a client side browser.

Serious Searching – Elasticsearch



Still the standard in our estimate, Elasticsearch is a RESTful API driven solution that offers intelligent and powerful search features. By utilizing Lucene as its core search engine, It provides you with a tool to tweeze out the data you really need from your analyses, quickly and accurately. Elasticsearch works by taking real world entities and storing them as structured JSON documents. This makes your data available immediately for fast searching. It can take a little time and a steep learning curve to optimize, but it’s worth it.

Great Metric Views – Graphite



Graphite1-450x368.png

Having great performance for your mass scale solutions is a top priority. To get this high level of performance you need to have insight. This is where Graphite can really help. This is a tool with a long history that allows you to manage metrics and visualizations. Although it doesn’t deliver the metrics itself directly, pretty much every observability open source library or tool out there supports Graphite. Once you have those metrics, Graphite offers a powerful tool to visualize them. It is an ideal tool for customization for individual environments but can have some challenges in scaling.

System Monitoring and Alerting Toolkit – Prometheus

Having a configurable and flexible toolkit for your system monitoring and alerting is essential for any DevOps professional. Prometheus is a multiple component based kit, suitable for any numeric based time series; it can store over a million time series in a single instance. It is fast to configure and get up and running, offering a GUI based dashboard, an alert manager, and a command line querying tool. It also has support for multi-dimensional data collection. Prometheus is definitely the new Graphite replacement. With many more plug-ins and connectivity and many integration points, if you are thinking about your next observability solution – it should be at the top of your list.

One Framework to Rule Them All – Sensu

The Sensu Monitoring Framework is a monitoring service and metric analysis system for servers, applications, etc. Written in Ruby, it also works with any programing language web application. It is a tool meant for consolidation, i.e. it offers a single platform for monitoring all of your company resources and even third party API’s. It’s a pretty simple tool to setup and can be provisioned using Puppet and other popular configuration management tools.

Sensu was built to accommodate highly scalable cloud infrastructures, so it’s really easy to scale with your infrastructure. There are quite a few tutorials and good blog posts to help with Sensu configuration. Our friends from AppsFlyer also gave a nice talk on the topic.

Best of the Rest – Pyformance, Snap, Dropwizard, cAdvisor, OpenTracing

Pyformance

Pyformance offers a bunch of performance metrics libraries in Python. It works with reporters such as hosted Graphite and Carbon. Pyformance is a simple toolset that offers a way to capture performance measurements and statistics. The documentation has useful sample code to use with the reports.

Snap

Snap is an API based toolkit which can be integrated with servers. It is a telemetry agent and not a full blown analytics application. One of Snap’s benefits is excellent security configuration, including SSL API encryption, and encryption of payload between components. Snap is built using three component plugins, that are pretty intuitive to setup. Plugin one, collects telemetry data; plugin two, converts data for reuse and storage; and plugin three publishes this data. It integrates with a number of common systems such as Facter and OpenStack. Grafana can also be fed using Snap.

Dropwizard

Dropwizard is a well-supported Java framework (or set of libraries depending on your view) with very good documentation. It has simple, out of the box support for metric collection and logging. It works at phenomenal speed and has a lot of community support. The Dropwizard metrics library works with Graphite and a number of reporters such as Splunk and DataDog. A slight downside is that errors are handled as plain text and not as JSON.

cAdvisor

cAdvisor1-300x194.png

cAdvisor is a visualization tool with native support for Docker and potentially any other container. The raw data can be exposed using a RESTful API. It is able to produce graphs of server performance and resource consumption. It is best used with a single docker host. It doesn’t yet have any robust alerts, but there is a roadmap item to advise on container performance. One of the common integration use cases is to use cAdvisor and hook it with Prometheus. CenturyLink wrote an excellent tutorial about that.

OpenTracing

OpenTracing supports a number of common programing languages and is a standard for application instrumentation for simplified tracing. OpenTracing allows you to create a ‘trace’ across the application layer by instrumenting your applications using the OpenTracing API.

Application performance tracing is a much more complex topic and I will elaborate on that specific topic in a specific post in the future.