Testing Your Monitoring Alerts

December 30, 2016 by Guy Fighel in SRE

As a baseline to work from, pages and alarms should only sound for events that are urgent, important, and actionable. By verifying the level of importance of an alert, you can then eliminate alert fatigue, and ensure that every page is quickly investigated, acted upon, or fine tuned. The result being increased uptime and ultimately a happier On-Call team.

It’s All About Timing and Optimization: 5 Tips to Alert Success

An effective monitoring, metrics, and alert system is one of the fundamental tools to an efficient DevOps operation. When you are working with small, iterative, and, often, fast to production releases alerts to problems become a key requirement to maintain the production environment. Alert systems are the heartbeat of the entire operation, without which downtime will persist. Designing an alert system to be optimal, with minimal false positives, is the key to that effectiveness.

Here are my top 5 tips for ‘effective alerts by design’ (EAbD):

#1 The mindfulness of alerts: When an alert is pushed out to an email system or third party platform it can end up being missed. Instead of just passing the alert off, keeping it within workflow control will make sure it stays visible and has a valid lifecycle.

#2 Defcon 5 – Keeping it subcritical: Managing alerts will give you the control you need to focus on the important ones and not go off chasing unicorns. Write sub-critical rules for your system. These will be specific to your production environment, an example may be that your database is close to capacity. The rules can also be prioritized and alerts sent based on that priority. It means you don’t have to react to sub-critical events at 2 am.

#3 Fixing the symptom, not the cause: Keeping alerts consistent, even when the underlying architecture changes, is the art of the possible if you page on the symptoms, rather than the cause. It is easier to capture problems using user-facing symptoms or other dependable services.

#4 Keep it Simple Simon (KISS): Create scope aware alerts. This will allow you to combine variables, so that instead of two or more alerts for one object, you get a single alert. For example, for alerts on disk usage which is split into forecast and usage levels – combine these two into a single alert.

#5 Putting false alarms to good use: When you do get false alarms, put them to work by using them as a basis for tightening up the alert condition or removing it from the paging list.

When designing your effective alert system, the use of an expressive language, rather than simple object/value UI widgets is key; it provides more flexibility and reduces errors.

Extending the Structure of Alerts

The art (and science) of alerts extends to their structure too. Human curated alert rules should be the baseline upon which your structure depends on. Designing the structure of the alert is down to some basic prerequisites including:

Natural grouping of environmental components
The correct aggregation
The correct information attached to the alert to give the most detail
Combining metrics to simplify alerts, whilst maintaining maximum detail
Use Boolean conditions such as negative events (look for things that are NOT happening which might lead into a problem)
Avoid using fixed alerts – give yourself flexibility and build in historical context to provide predictive analysis

Fine Tuning Alerts

Like all good ideas, your efficient alert system needs to be tested. Simulation is a good place to start, by creating simulation rules based on previous events. The goal is to reduce the noise. Reducing noise means you’re more likely to produce relevant alerts. Simulation and noise reduction is not a one-off event. You need to continue to carry out these exercises, fine tuning your alerts until you have the most meaningful alerts. And of course, reviews should be periodic as environments change. I also suggest to make it a habit every week (before the weekend starts) to review how was the false positive ratio was during in the previous week. Spending an hour for tuning before the weekend can save you and your team a great headache during the on-call weekend shift.

Similarly, paging events should be reviewed, including those ignored by administrators – this data can help you to refine rules to prevent false positives.

Fine Tuning Top Tips

Here are some tips for fine tuning your alerts so you can make sure they’re spot on, and as effective as possible.

#1 A Rule of rules: Alerts that are less than 50% accurate are broken; rules with a 10% false positive threshold are OK to go.

#2 A page too far: Get rid of extraneous paging events. If a page has fired, and when investigated shows nothing wrong – adjust the rule.

#3 The rise of the machine: Machine learning is perfectly placed to optimize alerts. Use human curated rules, enhanced with machine learning algorithms to create rules and fine-tune alerts.

#4 Repeat business: Take regular events, such as backups, into account when fine-tuning rules. If you have known maintenance going on, suppress alerts associated with that.

#5 Keeping Control: Set metrics for the on-call team and limit them to a set amount, through review, by differentiating between generated events and the events triggered by them.

Why It Pays to Structure Alerts Properly

The art of creating effective alert systems is down to using an intelligent approach. Keeping things simple, combining variables, and dampening down noise, coupled with prudent and mindful testing, will naturally result in improved alerts. Adding into that mix machine learning based on human curation will allow you to develop an optimized alert system that works for you, rather than against you.

In my next post I will look at improving alert response times. Response times are part of the critical path to a robust production system and are one of the key factors in an efficient support operation. Slow response times result in downtime, which ends up in lost time, money, and often customers too.

Preparation is Key: Plan for Leap Second and Prevent Downtime This New Year’s Eve

December 20, 2016 by Guy Fighel in SRE

2016 is a special year because it is one-second longer than last year. This is because our blue planet is slowing down, just a little, at 0.002 seconds per day. When you add this small change up over several years, this insignificant slowing becomes more significant.

The International Earth Rotation and Reference Systems Service (IERS) is the body tasked with making sure that the time on our clocks matches up with the actual time taken for the earth to rotate. To make up for the difference between earth clock time and earth rotation time, IERS occasionally adds a ‘leap second’ to the year on June 30 or December 31, at 23:59:59 Coordinated Universal Time (UTC). Since this practice began in 1972, 26 leap seconds have been added to our clock time to synchronize with the earth’s rotation.

For most people, adding a second to our clocks doesn’t matter much. This single second doesn’t alter the countdown on New Year’s Eve, and it doesn’t make a difference in the general course of the day. But for the tech industry, time is everything — and every second counts.

Accurate time-keeping is important; for example, it is necessary to maintain accurate records of commands, searches, and clicks. Furthermore, tracing the timelines of operations, second by second, can make a big difference when handling data. How this is achieved by different providers varies. Some adjust to the leap second by adding an additional second to the clock, resulting in both 23:59:60 and 00:00:00, in other words, stepping the clock backward so that 23:59:59 appears twice. Some ‘smear’ time, spreading the additional second across several hours of the day.

Regardless of the method, it’s easy to see why this can cause a problem if operating systems are not equipped to handle a minute with 61 seconds. The same is true if the systems have not been synchronized to apply the same method consistently. If leap second handling is not implemented correctly, many operations that occur in that extra second could go wrong.

WHY LEAP SECONDS MATTER

Making sure that systems are synced with UTC is vital for a proper system functioning and for thwarting bugs. How to ensure leap second synchronization can be a challenge, and there have been many occurrences where improperly synchronized systems have resulted in disruption and downtime.

The leap year, where we have a whole additional day, is comparable to the leap second, but on a larger scale. In a leap year, the extra day, which falls on February 29, catches some systems off-guard, causing issues. Programming for a leap year isn’t as simple as adding in the facility to manage that extra day, February 29. Systems can still trip up on December 31, as it becomes the 366th day of the year; the improperly coded systems understanding a year as having only 365 days.

But whether it’s an additional day or an additional second, the problem is largely the same. Programs must be equipped to maintain their functions even with a change in time; reconfiguring themselves, not only for that particular day or second, but adjusting the surrounding context too.

Below, are several examples of outages and malfunctions that have occurred over the last few years because of a system’s inability to properly sync with UTC time:

2012 Microsoft Azure outage. 2012 was a leap year which caused a major outage for several hours. The root of the problem was a Microsoft Azure security certificate software bug. Microsoft Azure employs ‘guest agents’ (GAs) that integrate its platform with applications that run in VMs. Each GA creates a transfer certificate within a server that is valid for one year from the date of its creation. In the case of certificates created on February 29, 2012, their valid-to dates were set as February 29, 2013. However, as 2013 was not a leap year, there would be no February 29. This resulted in the certificate creation process failing after several attempts. This, in turn, led to the server’s ‘host agent’ assuming the existence of a hardware problem, which automatically flags the server as being faulty and moves it to a state called ‘Human Investigate’. In the meantime, service healing automatically reincarnates the downed VMs on other servers. However, in this case, the move continued to recreate the failed certificate problem on the new servers too. This ultimately led to a cascade of servers going down for several hours. It’s easy to imagine a similar situation occurring from a leap second, whether at Microsoft or elsewhere.

2012 TomTom GPS bug. TomTom had a similar leap year bug, which caused malfunctions in some of its GPS navigation devices. A bug in the GPS firmware caused TomTom devices to fail to know their locations. It’s interesting to note that the GPS system does not use the leap second system either, leading to similar problems if not properly addressed.

2010 Sony PlayStation outage. PlayStation’s internal clocks mistakenly recognized 2010 as a leap year, resulting in the clocks being out of sync with real time. This resulted in error messages for users.

2008 Zune outage. Zune’s software became stuck in a faulty loop on December 31, 2008. The year was a leap year, and while the firmware had accounted for February 29, the code was unable to recognize a 366th day of the year, resulting in system-wide crashes.

2008 Microsoft Exchange outage. A system crash affected users who had tried to restart the System Attendant between 00:00 UTC on February 29, 2008, and 00:00 UTC on March 1, 2008. Because the software was not configured properly in order to account for the leap day, a reporting error caused the outage.

POTENTIAL SOLUTIONS

Luckily, most services have understood the problem of the leap year and the leap second, and have developed ways to mitigate any issues that might arise. Some of the potential solutions are explained below:

Time smearing. This is a solution that avoids the problem of whether to repeat 23:59:59 or to double up on 23:59:60 and 00:00:00, as described above. Instead, the extra second is “smeared” across a longer period of time, by almost imperceptibly lengthening seconds throughout the day. This way, clocks and network systems operate under the assumption that there are still only 86,400 seconds in a day, and are oblivious to the fact that anything is different. Google is one of the companies that employ this method across all of its services and APIs. Google’s smear period will last from 14:00:00 UTC on December 31, and end at 10:00:00 UTC on January 1. Each second during that period will be 13.9 μs longer than a standard second. The smear will continue even after the leap second is inserted, in order to offset the slight discrepancy that occurs during that time. By 10:00:00, smeared time will have realigned itself with UTC time.

NTP servers. Using network time protocol (NTP) servers instead of DNS servers, allows system clocks to sync with UTC time. Google provides these services for anyone who they state “needs to keep local clocks in sync with VM instances running on Google Compute Engine, to match the time used by Google APIs, or for those who just need a reliable time service.”

However, this comes with a number of caveats. For instance, Google recommends that all Compute Engine VMs use their NTP servers only, because other NTP servers may be unpredictable when it comes to handling the leap second. Moreover, Google warns against using a combination of its own NTP service and an external one, which could cause significant and unexpected problems with time recording. This, again, is largely due to the fact that there are several different methods to handle the leap second, and not all services may smear time like Google does.

Backward jump. In Linux, some kernels may use a backward jump to set the clock backward by one-second. This results in the clock reading 23:59:59, then 00:00:00, then 23:59:59 again. This is, for instance, implemented by Red Hat, the Amazon Linux AMI, and a number of Amazon Web Services, including CloudSearch clusters, EC2 Container Service instances, EMR Clusters, RDS instances, and Redshift instances. (Other AWS resources may have their own clocks and may only partially be managed by AWS).

Ignoring the leap second. Some services—such as Microsoft Azure, which runs the Windows Time service—do not handle the leap second. Windows Time syncs with UTC, but does not acknowledge the leap second, meaning that systems running Microsoft Azure are one-second ahead of UTC after the leap second occurs. Windows Time syncs regularly with UTC, the discrepancy corrected at the next synchronization.

POTENTIAL SOLUTION PROBLEM

Although these are all viable solutions for addressing the leap second, the problem is that there isn’t a single universal solution. Even providers who use time smearing, for instance, don’t all use the same smear: UTC-SLS uses a linear smear over 1,000 seconds before the leap; Google used a 20-hour cosine smear in 2008; Bloomberguses a linear smear for 2,000 seconds after the leap; and Amazon, Microsoft, and Akamai all use 24-hour smears.

Therefore, companies who run their infrastructures in multiple cloud providers, for example, might encounter some real issues when the systems they rely on are not all operating across the same time. Companies in this situation must pay close attention to the implementations their providers are using, and ensure they are compatible.

Google plans to adopt a proposed universal 24-hour smear for the next leap second (after this year’s) which will help keep changes in sync across systems and make the changes to each second even smaller. But for now, there remain plenty of discrepancies from provider to provider.

RECOMMENDATIONS

In order to avoid the sorts of failures and incompatibilities described above, I recommend taking the following steps in advance:

Configuring all of your network settings to NTP using Google’s public NTP servers. Do this across all of your Cloud providers or local data centers, which will enact a 20-hour time smear to account for the leap second. Note: if you are using Google Cloud this is already done for you including your Kubernetes clusters. However, if you have hardcoded your NTP server make sure to bring it back to Google’s default.
For internal services, we recommend applying your Operating System patching to match the Google smear algorithm or the backward jump (depending on your infrastructure requirements).
If you are only using a single Cloud provider such as AWS, you should not implement Google NTP servers. The reason being that AWS managed services such as RDS, Elasticache, EC2 Linux AMI, and others, will implement the backward jump fix which is not compatible with the Google smearing algorithm.
Make sure your On-Call person/team for the night between December 31 and January 1 is well prepared, and all escalation channels are well aware of the leap second situation. This coincides with the worst night for potential production issues, as often team members will be celebrating the New Year with limited access to their laptops.

It is, however, important to understand that this is not a huge problem if your teams are well prepared in advance. Now is the time to update your production settings and images, and bring awareness to your teams. We all want to celebrate the end of 2016, so let’s not wait for the last second as it may be way too late this time

Alerts Fatigue - The Pain of False Positives

November 29, 2016 by Guy Fighel in SRE

The consequence of letting alert levels spin out of control

There is a great case of alert fatigue documented by an institution in a completely different field. In 2013, the Boston Medical Center was experiencing a higher level of deaths due to mistakes in medical processes followed during hospital stays and visits. Their investigation traced a number of those deaths to the desensitization of nurses to alerts. Hospitals use monitoring and alert systems similar to those in technical operations environments, and in an hospital environments nurses are the equivalent of our NOCs or SREs depending on their level of experience. The nurses’ stations were being constantly bombarded by alerts, many of them duplicates or of low levels of urgency. Naturally, with little ability to sort out a very high volume of alerts by importance or eliminate unnecessary ones, nurses started to “suppress” most of these. Important alerts related to true medical emergencies were lost in the mix and being ignored, with dire consequences. The only way to resolve this problem was to put in place a system of intelligent management of alerts. The system was able to focus only on what mattered and the medical issues went away. We believe that this case and the lessons learned from the fix are directly relevant to the world of technical operations.

How did we get so many alerts?

Alerts are an essential part of the site reliability engineer toolkit. If implemented effectively, they directly reduce the mean-time-to-remediation (MTTR) of production issues. However too many alerts can overwhelm SREs and have the opposite effect, they start functioning as red herrings and become more of a distraction.

Alerts spike up upon the implementation of any monitoring application. When a team installs or develops internally an application performance management system, it will typically set a number of alerts throughout that system for the purpose of reacting or paying faster attention to specific changes in the environment. The same process is repeated for the networking, infrastructure, security, database, ISP, CDN, cloud, user, revenue metrics etc. monitoring tools. Each monitoring application will be configured individually to generate alerts focused on the part of the system it tracks. Typically, alerts will be repeated every few minutes until acknowledged. That’s a lot of alerts in and of itself. They all relate to the same system though. What happens when a problem in a part of the system reverberates on other parts of the system, creating a chain reaction effect? All of a sudden alerts related to the same issue are fired-up by all the individual tools only monitoring a portion of the system, creating a lot of noise and effectively masking the real cause of the issue.

Here is an example: let’s say that you have a cluster running on AWS or Google Cloud that is part of an autoscaling group. This application is behind a load balancer and is receiving external HTTP requests. Suddenly, your application will receive much more traffic and the load will rise. In that case, assuming you have set your scaling policy correctly, the application will scale and no issues will come out of it. However, what if, due to a human error, you didn’t set the scaling policy correctly? What will happen?

You will start getting Low Apdex alerts from your APM system
You will start getting CPU/Memory alerts from your hosts
You will receive CloudWatch alerts from the Load Balancer with lots of 4xx responses
If you have set an external health check on your app – such as Pingdom checks, you will receive alerts from that system on load time increase or failure in responses
Your application logs might increase with lots of error logs as your application can not keep up with the load and the increased error rate – this means that your disk space is going to decrease – so another error will come from those hosts with loaded disk space.
Since your system will write much more logs, you could also receive Disk I/O alerts
In case of custom health checks or custom monitoring checks – those will fail since the application is no longer responding – so, even more alerts on custom checks failure
If the application is connected to a DB or a Cache layer, and that layer is also not scalable, pushing more requests without any implementation of circuit breakers will push the load on the DB layer. So in that case, more alerts will be received on slow queries, or high load.

All of that alert activity is generated not because of a load issue, but because someone didn’t configure the scaling policy correctly…

Alerts have also a tendency to creep in over time. A one-time production issue may lead someone to design an alert specifically to prevent that issue from happening again: the alert is based on a specific threshold or context that makes sense based on the issue that just happened. After a while though, the specific issue never comes back, it was a one-time event, but the alert is still there flagging conditions that are perfectly normal, it’s become a false positive, creating confusion as nobody remembers what it is supposed to flag.

Impact of alert overload

Fortunately, alerts will not be the cause of deaths like in the case of the Boston Medical Center. But they have a very real direct economic impact.

Too much noise level increases the complexity of maintenance of the system, typically the MTTR increases because of the difficulty of finding the root cause of issues, and the mean-time-to-failure (MTTF) decreases because teams spend more time on reactive mode than on the proactive side. Those two metrics directly affect system availability. They also have a very direct impact on availability or system performance, and for every minute of downtime there is typically a direct loss in revenue.

Alert fatigue is a also a very real thing. We will discuss in a future post the human impact on teams’ morale and stress, and by extension the impact on the Company.

How to deal with the issue or alert overload?

Having an intelligent approach to the design of an alert system is the first step towards getting false positives back under your control. Different machine learning classification techniques might help with this problem, but there is no real silver bullet here. Many good statistical models and approaches exist and they all give you a toolset to generate rules based on understanding and learning from your system output. At the end of the day algorithms and approaches like machine learning and deep learning, are all quite important, but they are only tools in a box. These tools, human-curated, determine outcomes and can be used to train the system to generate a more accurate picture. One of the key areas that machine intelligence based alert systems focus on is to precisely assess and classify events. Setting events as critical, but also having the flexibility to handle non-critical events that, if dealt with, can prevent them becoming critical. All of this, of course, needs to be done within an environment of superb UX that gives readily digestible visualization of events but also offers improved response times through well composed alert notifications. An important aspect of machine intelligence generated alerts is that they are flexible and can continue to be tweaked to create even more precision. This is especially important as production environments are not static entities.

This package of machine intelligence driven smart alert systems will increase your ability to reduce the noise. Reduced noise allows people to focus in on the root cause, saving time, and creating much greater work satisfaction and control – not just in DevOps but throughout the whole of the organization.

Fighting Fire with Machine Intelligence

We have come to a juncture in technology development and we need a better way of managing and controlling alert systems. Fault management, the identification of and response to abnormal conditions, is a major component of the human’s role as supervisory controller of dynamic systems. Machine intelligence toolset is a leap forward in alert system design and configuration. It is a way to truly dampen down the noise, while prioritizing the important alerts. Without it, as our production environments become ever more distributed and complex, we may find that we are drowning in those false positives.