How To Improve Alert Response Times

January 06, 2017 by Guy Fighel in SRE

Response times are part of the critical path to a robust production system and are one of the key factors in an efficient support operation. Slow response times result in downtime, which ends up in lost time, money, and often customers too.

There are a number of methods that, when combined, help you improve response times. Here we’ll take a look at some top tips for improving response times through an intelligent approach to notifications.

Writing Effective Alerts – Things to Consider

Our first method to reduce response times is to generate well-written alerts. If your alert notifications give your support team a clear understanding of an issue, the team can prioritize important alerts and focus on them. Here are my top tips to writing alert notifications that your teams will respond to:

Tip #1 Clear and present danger: When creating your alert notification template make sure you write a very clear description so that your team, even blurry-eyed in the middle of the night, can understand the importance level.

Tip #2 In summary…: Setup a focused summary of all of the possible cause based rules. It needs to be precise and brief – enough to allow the reader to skim and spot if the cause is already a known issue.

Tip #3 Do the work for them: If you know that certain alerts will require certain information to resolve the issue, add that information in to the body of the alert for reference. Links to your favorite knowledge management system or runbook will be highly appreciated by the on-call person.

Tip #4 No such thing as too much info: Add in additional information such as internal Wiki links and other ticket responses that are pertinent to the notification.

How to Use Policies to Improve Alert Response Times

Preparing notifications using pre-completed information is one way to improve response times. Another is to apply policy settings to notification generation. The top 4 tips for the type of policies which help to cut response times are:

Tip #1 Be negative: Allow your teams to negatively acknowledge alerts. This creates a domino effect and moves the alert onto the next person.

Tip #2 Be a thoughtful fighter: The fighter pilot, John Boyd developed a framework for thinking known as OODA or ‘observe, orient, decide and act’. It is a process based method of thinking about a problem and can be applied effectively to your on-call engineers.

Tip #3 Aggregation for information: Create incident threads from related incoming incidents by aggregating them into a single thread. It makes notifications more clear and relatable.

Tip #4 Silence of the logs: Silence redundant alarms and log them.

Protocols and Response Time Optimization

Optimizing alert notifications to improve response times using clear outlines and sound policies can be augmented by choosing the right channel protocol to pass those alerts through. The types of channels used for communicating alerts vary in the immediacy of the method. Channel protocol choices in order of responsiveness (and annoyance) are:

Dashboard
Email
Chat (Slack/HipChat etc.)
SMS/Page
Phone call

Which you choose as your protocol during configuration will impact the response time. The balance between getting the right level of response / response time, and preventing the burden / annoyance of a false alarm, can be accommodated by good testing. Test against the history of the alert and start off with a more passive protocol for communication of that alert – if it ‘behaves’, and doesn’t generate false positives, then it can go up the protocol ladder to the next channel level.

Alert Taxonomy

Taxonomy is all about classification, and classification is all about making things easier to understand allowing us to see patterns and shared characteristics.

Alerts can be classified into several areas:

Severity levels
Alert states
Alert notification criticality
A miscellaneous sub-set that classifies unactionable alerts

Having a coherent alert taxonomy can give you a tool to apply policies and protocols, making your overall alert system well designed, effective, with appropriate response times that optimize system uptime. This ensures teams are responding to the right alert, at the right time, in a timely manner.

Testing Your Monitoring Alerts

December 30, 2016 by Guy Fighel in SRE

As a baseline to work from, pages and alarms should only sound for events that are urgent, important, and actionable. By verifying the level of importance of an alert, you can then eliminate alert fatigue, and ensure that every page is quickly investigated, acted upon, or fine tuned. The result being increased uptime and ultimately a happier On-Call team.

It’s All About Timing and Optimization: 5 Tips to Alert Success

An effective monitoring, metrics, and alert system is one of the fundamental tools to an efficient DevOps operation. When you are working with small, iterative, and, often, fast to production releases alerts to problems become a key requirement to maintain the production environment. Alert systems are the heartbeat of the entire operation, without which downtime will persist. Designing an alert system to be optimal, with minimal false positives, is the key to that effectiveness.

Here are my top 5 tips for ‘effective alerts by design’ (EAbD):

#1 The mindfulness of alerts: When an alert is pushed out to an email system or third party platform it can end up being missed. Instead of just passing the alert off, keeping it within workflow control will make sure it stays visible and has a valid lifecycle.

#2 Defcon 5 – Keeping it subcritical: Managing alerts will give you the control you need to focus on the important ones and not go off chasing unicorns. Write sub-critical rules for your system. These will be specific to your production environment, an example may be that your database is close to capacity. The rules can also be prioritized and alerts sent based on that priority. It means you don’t have to react to sub-critical events at 2 am.

#3 Fixing the symptom, not the cause: Keeping alerts consistent, even when the underlying architecture changes, is the art of the possible if you page on the symptoms, rather than the cause. It is easier to capture problems using user-facing symptoms or other dependable services.

#4 Keep it Simple Simon (KISS): Create scope aware alerts. This will allow you to combine variables, so that instead of two or more alerts for one object, you get a single alert. For example, for alerts on disk usage which is split into forecast and usage levels – combine these two into a single alert.

#5 Putting false alarms to good use: When you do get false alarms, put them to work by using them as a basis for tightening up the alert condition or removing it from the paging list.

When designing your effective alert system, the use of an expressive language, rather than simple object/value UI widgets is key; it provides more flexibility and reduces errors.

Extending the Structure of Alerts

The art (and science) of alerts extends to their structure too. Human curated alert rules should be the baseline upon which your structure depends on. Designing the structure of the alert is down to some basic prerequisites including:

Natural grouping of environmental components
The correct aggregation
The correct information attached to the alert to give the most detail
Combining metrics to simplify alerts, whilst maintaining maximum detail
Use Boolean conditions such as negative events (look for things that are NOT happening which might lead into a problem)
Avoid using fixed alerts – give yourself flexibility and build in historical context to provide predictive analysis

Fine Tuning Alerts

Like all good ideas, your efficient alert system needs to be tested. Simulation is a good place to start, by creating simulation rules based on previous events. The goal is to reduce the noise. Reducing noise means you’re more likely to produce relevant alerts. Simulation and noise reduction is not a one-off event. You need to continue to carry out these exercises, fine tuning your alerts until you have the most meaningful alerts. And of course, reviews should be periodic as environments change. I also suggest to make it a habit every week (before the weekend starts) to review how was the false positive ratio was during in the previous week. Spending an hour for tuning before the weekend can save you and your team a great headache during the on-call weekend shift.

Similarly, paging events should be reviewed, including those ignored by administrators – this data can help you to refine rules to prevent false positives.

Fine Tuning Top Tips

Here are some tips for fine tuning your alerts so you can make sure they’re spot on, and as effective as possible.

#1 A Rule of rules: Alerts that are less than 50% accurate are broken; rules with a 10% false positive threshold are OK to go.

#2 A page too far: Get rid of extraneous paging events. If a page has fired, and when investigated shows nothing wrong – adjust the rule.

#3 The rise of the machine: Machine learning is perfectly placed to optimize alerts. Use human curated rules, enhanced with machine learning algorithms to create rules and fine-tune alerts.

#4 Repeat business: Take regular events, such as backups, into account when fine-tuning rules. If you have known maintenance going on, suppress alerts associated with that.

#5 Keeping Control: Set metrics for the on-call team and limit them to a set amount, through review, by differentiating between generated events and the events triggered by them.

Why It Pays to Structure Alerts Properly

The art of creating effective alert systems is down to using an intelligent approach. Keeping things simple, combining variables, and dampening down noise, coupled with prudent and mindful testing, will naturally result in improved alerts. Adding into that mix machine learning based on human curation will allow you to develop an optimized alert system that works for you, rather than against you.

In my next post I will look at improving alert response times. Response times are part of the critical path to a robust production system and are one of the key factors in an efficient support operation. Slow response times result in downtime, which ends up in lost time, money, and often customers too.

Preparation is Key: Plan for Leap Second and Prevent Downtime This New Year’s Eve

December 20, 2016 by Guy Fighel in SRE

2016 is a special year because it is one-second longer than last year. This is because our blue planet is slowing down, just a little, at 0.002 seconds per day. When you add this small change up over several years, this insignificant slowing becomes more significant.

The International Earth Rotation and Reference Systems Service (IERS) is the body tasked with making sure that the time on our clocks matches up with the actual time taken for the earth to rotate. To make up for the difference between earth clock time and earth rotation time, IERS occasionally adds a ‘leap second’ to the year on June 30 or December 31, at 23:59:59 Coordinated Universal Time (UTC). Since this practice began in 1972, 26 leap seconds have been added to our clock time to synchronize with the earth’s rotation.

For most people, adding a second to our clocks doesn’t matter much. This single second doesn’t alter the countdown on New Year’s Eve, and it doesn’t make a difference in the general course of the day. But for the tech industry, time is everything — and every second counts.

Accurate time-keeping is important; for example, it is necessary to maintain accurate records of commands, searches, and clicks. Furthermore, tracing the timelines of operations, second by second, can make a big difference when handling data. How this is achieved by different providers varies. Some adjust to the leap second by adding an additional second to the clock, resulting in both 23:59:60 and 00:00:00, in other words, stepping the clock backward so that 23:59:59 appears twice. Some ‘smear’ time, spreading the additional second across several hours of the day.

Regardless of the method, it’s easy to see why this can cause a problem if operating systems are not equipped to handle a minute with 61 seconds. The same is true if the systems have not been synchronized to apply the same method consistently. If leap second handling is not implemented correctly, many operations that occur in that extra second could go wrong.

WHY LEAP SECONDS MATTER

Making sure that systems are synced with UTC is vital for a proper system functioning and for thwarting bugs. How to ensure leap second synchronization can be a challenge, and there have been many occurrences where improperly synchronized systems have resulted in disruption and downtime.

The leap year, where we have a whole additional day, is comparable to the leap second, but on a larger scale. In a leap year, the extra day, which falls on February 29, catches some systems off-guard, causing issues. Programming for a leap year isn’t as simple as adding in the facility to manage that extra day, February 29. Systems can still trip up on December 31, as it becomes the 366th day of the year; the improperly coded systems understanding a year as having only 365 days.

But whether it’s an additional day or an additional second, the problem is largely the same. Programs must be equipped to maintain their functions even with a change in time; reconfiguring themselves, not only for that particular day or second, but adjusting the surrounding context too.

Below, are several examples of outages and malfunctions that have occurred over the last few years because of a system’s inability to properly sync with UTC time:

2012 Microsoft Azure outage. 2012 was a leap year which caused a major outage for several hours. The root of the problem was a Microsoft Azure security certificate software bug. Microsoft Azure employs ‘guest agents’ (GAs) that integrate its platform with applications that run in VMs. Each GA creates a transfer certificate within a server that is valid for one year from the date of its creation. In the case of certificates created on February 29, 2012, their valid-to dates were set as February 29, 2013. However, as 2013 was not a leap year, there would be no February 29. This resulted in the certificate creation process failing after several attempts. This, in turn, led to the server’s ‘host agent’ assuming the existence of a hardware problem, which automatically flags the server as being faulty and moves it to a state called ‘Human Investigate’. In the meantime, service healing automatically reincarnates the downed VMs on other servers. However, in this case, the move continued to recreate the failed certificate problem on the new servers too. This ultimately led to a cascade of servers going down for several hours. It’s easy to imagine a similar situation occurring from a leap second, whether at Microsoft or elsewhere.

2012 TomTom GPS bug. TomTom had a similar leap year bug, which caused malfunctions in some of its GPS navigation devices. A bug in the GPS firmware caused TomTom devices to fail to know their locations. It’s interesting to note that the GPS system does not use the leap second system either, leading to similar problems if not properly addressed.

2010 Sony PlayStation outage. PlayStation’s internal clocks mistakenly recognized 2010 as a leap year, resulting in the clocks being out of sync with real time. This resulted in error messages for users.

2008 Zune outage. Zune’s software became stuck in a faulty loop on December 31, 2008. The year was a leap year, and while the firmware had accounted for February 29, the code was unable to recognize a 366th day of the year, resulting in system-wide crashes.

2008 Microsoft Exchange outage. A system crash affected users who had tried to restart the System Attendant between 00:00 UTC on February 29, 2008, and 00:00 UTC on March 1, 2008. Because the software was not configured properly in order to account for the leap day, a reporting error caused the outage.

POTENTIAL SOLUTIONS

Luckily, most services have understood the problem of the leap year and the leap second, and have developed ways to mitigate any issues that might arise. Some of the potential solutions are explained below:

Time smearing. This is a solution that avoids the problem of whether to repeat 23:59:59 or to double up on 23:59:60 and 00:00:00, as described above. Instead, the extra second is “smeared” across a longer period of time, by almost imperceptibly lengthening seconds throughout the day. This way, clocks and network systems operate under the assumption that there are still only 86,400 seconds in a day, and are oblivious to the fact that anything is different. Google is one of the companies that employ this method across all of its services and APIs. Google’s smear period will last from 14:00:00 UTC on December 31, and end at 10:00:00 UTC on January 1. Each second during that period will be 13.9 μs longer than a standard second. The smear will continue even after the leap second is inserted, in order to offset the slight discrepancy that occurs during that time. By 10:00:00, smeared time will have realigned itself with UTC time.

NTP servers. Using network time protocol (NTP) servers instead of DNS servers, allows system clocks to sync with UTC time. Google provides these services for anyone who they state “needs to keep local clocks in sync with VM instances running on Google Compute Engine, to match the time used by Google APIs, or for those who just need a reliable time service.”

However, this comes with a number of caveats. For instance, Google recommends that all Compute Engine VMs use their NTP servers only, because other NTP servers may be unpredictable when it comes to handling the leap second. Moreover, Google warns against using a combination of its own NTP service and an external one, which could cause significant and unexpected problems with time recording. This, again, is largely due to the fact that there are several different methods to handle the leap second, and not all services may smear time like Google does.

Backward jump. In Linux, some kernels may use a backward jump to set the clock backward by one-second. This results in the clock reading 23:59:59, then 00:00:00, then 23:59:59 again. This is, for instance, implemented by Red Hat, the Amazon Linux AMI, and a number of Amazon Web Services, including CloudSearch clusters, EC2 Container Service instances, EMR Clusters, RDS instances, and Redshift instances. (Other AWS resources may have their own clocks and may only partially be managed by AWS).

Ignoring the leap second. Some services—such as Microsoft Azure, which runs the Windows Time service—do not handle the leap second. Windows Time syncs with UTC, but does not acknowledge the leap second, meaning that systems running Microsoft Azure are one-second ahead of UTC after the leap second occurs. Windows Time syncs regularly with UTC, the discrepancy corrected at the next synchronization.

POTENTIAL SOLUTION PROBLEM

Although these are all viable solutions for addressing the leap second, the problem is that there isn’t a single universal solution. Even providers who use time smearing, for instance, don’t all use the same smear: UTC-SLS uses a linear smear over 1,000 seconds before the leap; Google used a 20-hour cosine smear in 2008; Bloomberguses a linear smear for 2,000 seconds after the leap; and Amazon, Microsoft, and Akamai all use 24-hour smears.

Therefore, companies who run their infrastructures in multiple cloud providers, for example, might encounter some real issues when the systems they rely on are not all operating across the same time. Companies in this situation must pay close attention to the implementations their providers are using, and ensure they are compatible.

Google plans to adopt a proposed universal 24-hour smear for the next leap second (after this year’s) which will help keep changes in sync across systems and make the changes to each second even smaller. But for now, there remain plenty of discrepancies from provider to provider.

RECOMMENDATIONS

In order to avoid the sorts of failures and incompatibilities described above, I recommend taking the following steps in advance:

Configuring all of your network settings to NTP using Google’s public NTP servers. Do this across all of your Cloud providers or local data centers, which will enact a 20-hour time smear to account for the leap second. Note: if you are using Google Cloud this is already done for you including your Kubernetes clusters. However, if you have hardcoded your NTP server make sure to bring it back to Google’s default.
For internal services, we recommend applying your Operating System patching to match the Google smear algorithm or the backward jump (depending on your infrastructure requirements).
If you are only using a single Cloud provider such as AWS, you should not implement Google NTP servers. The reason being that AWS managed services such as RDS, Elasticache, EC2 Linux AMI, and others, will implement the backward jump fix which is not compatible with the Google smearing algorithm.
Make sure your On-Call person/team for the night between December 31 and January 1 is well prepared, and all escalation channels are well aware of the leap second situation. This coincides with the worst night for potential production issues, as often team members will be celebrating the New Year with limited access to their laptops.

It is, however, important to understand that this is not a huge problem if your teams are well prepared in advance. Now is the time to update your production settings and images, and bring awareness to your teams. We all want to celebrate the end of 2016, so let’s not wait for the last second as it may be way too late this time