Preparation is Key: Plan for Leap Second and Prevent Downtime This New Year’s Eve
2016 is a special year because it is one-second longer than last year. This is because our blue planet is slowing down, just a little, at 0.002 seconds per day. When you add this small change up over several years, this insignificant slowing becomes more significant.
The International Earth Rotation and Reference Systems Service (IERS) is the body tasked with making sure that the time on our clocks matches up with the actual time taken for the earth to rotate. To make up for the difference between earth clock time and earth rotation time, IERS occasionally adds a ‘leap second’ to the year on June 30 or December 31, at 23:59:59 Coordinated Universal Time (UTC). Since this practice began in 1972, 26 leap seconds have been added to our clock time to synchronize with the earth’s rotation.
For most people, adding a second to our clocks doesn’t matter much. This single second doesn’t alter the countdown on New Year’s Eve, and it doesn’t make a difference in the general course of the day. But for the tech industry, time is everything — and every second counts.
Accurate time-keeping is important; for example, it is necessary to maintain accurate records of commands, searches, and clicks. Furthermore, tracing the timelines of operations, second by second, can make a big difference when handling data. How this is achieved by different providers varies. Some adjust to the leap second by adding an additional second to the clock, resulting in both 23:59:60 and 00:00:00, in other words, stepping the clock backward so that 23:59:59 appears twice. Some ‘smear’ time, spreading the additional second across several hours of the day.
Regardless of the method, it’s easy to see why this can cause a problem if operating systems are not equipped to handle a minute with 61 seconds. The same is true if the systems have not been synchronized to apply the same method consistently. If leap second handling is not implemented correctly, many operations that occur in that extra second could go wrong.
WHY LEAP SECONDS MATTER
Making sure that systems are synced with UTC is vital for a proper system functioning and for thwarting bugs. How to ensure leap second synchronization can be a challenge, and there have been many occurrences where improperly synchronized systems have resulted in disruption and downtime.
The leap year, where we have a whole additional day, is comparable to the leap second, but on a larger scale. In a leap year, the extra day, which falls on February 29, catches some systems off-guard, causing issues. Programming for a leap year isn’t as simple as adding in the facility to manage that extra day, February 29. Systems can still trip up on December 31, as it becomes the 366th day of the year; the improperly coded systems understanding a year as having only 365 days.
But whether it’s an additional day or an additional second, the problem is largely the same. Programs must be equipped to maintain their functions even with a change in time; reconfiguring themselves, not only for that particular day or second, but adjusting the surrounding context too.
Below, are several examples of outages and malfunctions that have occurred over the last few years because of a system’s inability to properly sync with UTC time:
2012 Microsoft Azure outage. 2012 was a leap year which caused a major outage for several hours. The root of the problem was a Microsoft Azure security certificate software bug. Microsoft Azure employs ‘guest agents’ (GAs) that integrate its platform with applications that run in VMs. Each GA creates a transfer certificate within a server that is valid for one year from the date of its creation. In the case of certificates created on February 29, 2012, their valid-to dates were set as February 29, 2013. However, as 2013 was not a leap year, there would be no February 29. This resulted in the certificate creation process failing after several attempts. This, in turn, led to the server’s ‘host agent’ assuming the existence of a hardware problem, which automatically flags the server as being faulty and moves it to a state called ‘Human Investigate’. In the meantime, service healing automatically reincarnates the downed VMs on other servers. However, in this case, the move continued to recreate the failed certificate problem on the new servers too. This ultimately led to a cascade of servers going down for several hours. It’s easy to imagine a similar situation occurring from a leap second, whether at Microsoft or elsewhere.
2012 TomTom GPS bug. TomTom had a similar leap year bug, which caused malfunctions in some of its GPS navigation devices. A bug in the GPS firmware caused TomTom devices to fail to know their locations. It’s interesting to note that the GPS system does not use the leap second system either, leading to similar problems if not properly addressed.
2010 Sony PlayStation outage. PlayStation’s internal clocks mistakenly recognized 2010 as a leap year, resulting in the clocks being out of sync with real time. This resulted in error messages for users.
2008 Zune outage. Zune’s software became stuck in a faulty loop on December 31, 2008. The year was a leap year, and while the firmware had accounted for February 29, the code was unable to recognize a 366th day of the year, resulting in system-wide crashes.
2008 Microsoft Exchange outage. A system crash affected users who had tried to restart the System Attendant between 00:00 UTC on February 29, 2008, and 00:00 UTC on March 1, 2008. Because the software was not configured properly in order to account for the leap day, a reporting error caused the outage.
POTENTIAL SOLUTIONS
Luckily, most services have understood the problem of the leap year and the leap second, and have developed ways to mitigate any issues that might arise. Some of the potential solutions are explained below:
Time smearing. This is a solution that avoids the problem of whether to repeat 23:59:59 or to double up on 23:59:60 and 00:00:00, as described above. Instead, the extra second is “smeared” across a longer period of time, by almost imperceptibly lengthening seconds throughout the day. This way, clocks and network systems operate under the assumption that there are still only 86,400 seconds in a day, and are oblivious to the fact that anything is different. Google is one of the companies that employ this method across all of its services and APIs. Google’s smear period will last from 14:00:00 UTC on December 31, and end at 10:00:00 UTC on January 1. Each second during that period will be 13.9 μs longer than a standard second. The smear will continue even after the leap second is inserted, in order to offset the slight discrepancy that occurs during that time. By 10:00:00, smeared time will have realigned itself with UTC time.
NTP servers. Using network time protocol (NTP) servers instead of DNS servers, allows system clocks to sync with UTC time. Google provides these services for anyone who they state “needs to keep local clocks in sync with VM instances running on Google Compute Engine, to match the time used by Google APIs, or for those who just need a reliable time service.”
However, this comes with a number of caveats. For instance, Google recommends that all Compute Engine VMs use their NTP servers only, because other NTP servers may be unpredictable when it comes to handling the leap second. Moreover, Google warns against using a combination of its own NTP service and an external one, which could cause significant and unexpected problems with time recording. This, again, is largely due to the fact that there are several different methods to handle the leap second, and not all services may smear time like Google does.
Backward jump. In Linux, some kernels may use a backward jump to set the clock backward by one-second. This results in the clock reading 23:59:59, then 00:00:00, then 23:59:59 again. This is, for instance, implemented by Red Hat, the Amazon Linux AMI, and a number of Amazon Web Services, including CloudSearch clusters, EC2 Container Service instances, EMR Clusters, RDS instances, and Redshift instances. (Other AWS resources may have their own clocks and may only partially be managed by AWS).
Ignoring the leap second. Some services—such as Microsoft Azure, which runs the Windows Time service—do not handle the leap second. Windows Time syncs with UTC, but does not acknowledge the leap second, meaning that systems running Microsoft Azure are one-second ahead of UTC after the leap second occurs. Windows Time syncs regularly with UTC, the discrepancy corrected at the next synchronization.
POTENTIAL SOLUTION PROBLEM
Although these are all viable solutions for addressing the leap second, the problem is that there isn’t a single universal solution. Even providers who use time smearing, for instance, don’t all use the same smear: UTC-SLS uses a linear smear over 1,000 seconds before the leap; Google used a 20-hour cosine smear in 2008; Bloomberguses a linear smear for 2,000 seconds after the leap; and Amazon, Microsoft, and Akamai all use 24-hour smears.
Therefore, companies who run their infrastructures in multiple cloud providers, for example, might encounter some real issues when the systems they rely on are not all operating across the same time. Companies in this situation must pay close attention to the implementations their providers are using, and ensure they are compatible.
Google plans to adopt a proposed universal 24-hour smear for the next leap second (after this year’s) which will help keep changes in sync across systems and make the changes to each second even smaller. But for now, there remain plenty of discrepancies from provider to provider.
RECOMMENDATIONS
In order to avoid the sorts of failures and incompatibilities described above, I recommend taking the following steps in advance:
Configuring all of your network settings to NTP using Google’s public NTP servers. Do this across all of your Cloud providers or local data centers, which will enact a 20-hour time smear to account for the leap second. Note: if you are using Google Cloud this is already done for you including your Kubernetes clusters. However, if you have hardcoded your NTP server make sure to bring it back to Google’s default.
For internal services, we recommend applying your Operating System patching to match the Google smear algorithm or the backward jump (depending on your infrastructure requirements).
If you are only using a single Cloud provider such as AWS, you should not implement Google NTP servers. The reason being that AWS managed services such as RDS, Elasticache, EC2 Linux AMI, and others, will implement the backward jump fix which is not compatible with the Google smearing algorithm.
Make sure your On-Call person/team for the night between December 31 and January 1 is well prepared, and all escalation channels are well aware of the leap second situation. This coincides with the worst night for potential production issues, as often team members will be celebrating the New Year with limited access to their laptops.
It is, however, important to understand that this is not a huge problem if your teams are well prepared in advance. Now is the time to update your production settings and images, and bring awareness to your teams. We all want to celebrate the end of 2016, so let’s not wait for the last second as it may be way too late this time