Guy Fighel

View Original

Multiple DNS providers, How to Avoid Single Point of Failure

Last month, the largest distributed denial of service (DDoS) attack on record rocked the Internet, taking down many major websites such as Twitter, Amazon, The New York Times, PayPal, and many others. Attributed to prominent hacker groups, the three attacks targeted websites operated by Dyn, a popular domain name system (DNS) provider that much of the Internet relies on to power websites. And while it may be easy to blame Dyn’s security measures or other weaknesses for the outages, the truth is that the October 21 attack wasn’t a problem specific to Dyn or any of the targeted websites.

Over the last decade or so, every major DNS provider has experienced DDoS attacks and periods of down time. These attacks are nearly inevitable—and yet most companies haven’t acknowledged that they’re likely to be attacked again, and they haven’t properly prepared or learned from previous experiences. DNS itself is not the root cause of the outages, though the way we rely on DNS providers is a contributing factor. Rather, the main problem is that we design systems that are not resilient.

One solution to preventing future DNS outages (especially important at peak holiday season) is for companies to incorporate redundancy into their server options. Establishing this as a default would help prevent websites from going down even when under attack and maintain continuous service for users.

What went wrong on October 21

A DDoS attack is usually orchestrated by a “botnet,” or a group of computers infected with malware, repeatedly attempting to access a server with large amounts of traffic until the server is overloaded and unable to function. This particular Dyn attack was notable for two reasons. First, rather than using a typical botnet consisting of infected computers, this attack employed the Mirai botnet, which uses other Internet-connected devices like digital cameras, baby monitors, printers, and VCRs to contact the server. Second, because the Mirai botnet had access to such a large number of devices, the attack could be much faster and much larger than a typical DDoS attack, which usually targets individual websites rather than entire swaths of the Internet. This attack is said to be twice as large as any previous attack of this kind.

Over the course of three separate, sustained attacks throughout the day—at 7:00 am, 11:52 am, and 4:00 pm EDT—at least 70 major websites experienced prolonged outages. Affected websites included news outlets (e.g., The New York Times, The Wall Street Journal, Vox Media, Fox News, and The Guardian), entertainment services (e.g., Netflix, HBO, Spotify, and Xbox Live), social media sites (e.g., Twitter, Pinterest, Reddit, and Tumblr), financial services (e.g. Visa and PayPal), and other major websites like Amazon, Airbnb, and Verizon, among dozens of others.

On all three occasions, the botnet sent malicious signals to the Dyn DNS provider, which made repeated attempts to contact requested IP addresses. This, compounded with legitimate retry activity as servers continued trying to connect, led to up to 20 times the amount of normal traffic. Dyn estimates that the attack reached up to 100,000 malicious endpoints.

The attack was traced to the groups Anonymous and New World Hackers, which claimed to be retaliating for Ecuador’s denial of Internet service to WikiLeaks founder Julian Assange. Dyn finally mitigated the damage and restored service across all affected websites within a couple of hours of the third wave. Dyn also enabled protective measures immediately thereafter and said this attack helps propel an important discussion on security.

But security itself may not be the real issue at hand. Moreover, though this may have been the largest DDoS attack to date, it was by no means the first. In other words, we’ve seen attacks like this over and over again in recent years. Why haven’t we learned enough from them to protect ourselves?

History repeats itself

The last decade has seen a number of orchestrated DDoS attacks on many of the major DNS providers. This just goes to show that no provider is immune—but also that no provider is directly at fault. This section outlines some of the major DDoS attacks that have occurred over the last several years.

UltraDNS/Neustar, 10/15/16. Less than a week before the Dyn outage—servers connected to the UltraDNS provider became unreachable, taking down numerous popular websites, including Netflix, Expedia, and bank websites. The outage lasted from 1:25 to 3:45 pm PDT and affected users in both the United States and Europe. At the peak of the outages, the servers were clearly incapable of processing the overwhelming number of requests they were receiving. UltraDNS uses Anycast, which means that many servers connect to a single IP address. Each domain hosted by UltraDNS had six name servers responding to queries. Even though it turned out that this outage was caused by internal errors and not a DDoS attack, the problem was the same one that Dyn experienced: The DNS name servers were unable to successfully resolve queries, resulting in the website outages.

NS1, 5/16/16. NS1 suffered a sustained DDoS attack that crashed various websites due to the attackers’ use and combination of a number of strategies, including high volume traffic, malicious direct DNS queries, random label attacks, and malformed packet attacks. The attack also included broadly sourced queries for real customer domains and their variations, along with specific attacks further upstream. In its postmortem, NS1 determined that the provider itself was the target of the attack, rather than any of its customers. The DDoS attack moved between NS1’s networks across continents, and this evolving mobility also made the outages more difficult to mitigate. Moreover, malicious traffic disguised as legitimate traffic increased the load that the delivery system had to bear, and frequent retry attempts for DNS lookups exacerbated the issues the servers were experiencing. Even though NS1, like many DNS providers, already had some measures in place to minimize the impact of DDoS attacks (and has successfully done so for smaller-scale attacks, with no effect on users), an attack this large—which multiplied traffic by thousands or tens of thousands—still resulted in some packet loss and service failure.

Amazon Route53/CloudFront, 11/26/14. Route 53, the DNS provider for Amazon Web Services (AWS), experienced connection failures with AWS’ CDN provider, Cloudfront. Users reported experiencing errors when they made DNS queries for any Cloudfront sites over a period of a couple of hours. The problem occurred when the two services were used together; queries for records unrelated to Cloudfront were returned without a problem, while any Cloudfront queries or those with Cloudfront aliases were not able to be provided by Route 53 servers.

DNSimple, 9/19/14This attack on DNSimple began slowly, with attacks on individual customers’ domain names. Then, however, the attack escalated to target multiple DNSimple data centers more broadly. This led to an outage of many customer websites, and the attack subsided once DNSimple asked a targeted customer to change the delegation for its domain to another provider. Nonetheless, effects of the outage persisted because DNSimple’s own name servers could not keep up with the volume of queries that had occurred during the peak of the attack itself. Other name servers continued responding, but were slow. After eventually determining which data centers were or were not responding to queries, DNSimple was able to pull failed name servers from the network, repair them, and restore service to normal working conditions.

DNSimple, easyDNS, and TPP Wholesale, 6/3/13. A series of attacks that may or may not have been related took down three separate DNS providers and lasted multiple days. As in the previous examples, these companies’ name servers were overwhelmed with malicious traffic difficult to distinguish from legitimate traffic (which itself increased as attempts were made to retry downed sites). These attacks, too, targeted the DNS providers themselves on a large scale rather than honing in on particular websites or customers. As a result, customers experienced long outages across the network. DNSimple noted that their servers were used as amplifiers for the attack, a technique that sends numerous queries to DNS servers using the victim’s IP address in a short period of time. Amplification allows for large-scale and longer attacks to be carried out with ease. Controlling these attacks was not simple: EasyDNS was only able to partially mitigate the problem right away, and TPP Wholesale was forced to rate-limit its DNS queries—a risky choice that can occur in a denial of service to real customers.

CloudFlare, 3/3/13. After CloudFlare detected a DDoS attacktargeting one of its customers, a profile of the attack was sent to CloudFlare’s Juniper edge routers via the Flowspec protocol in order to keep the attack from spreading by rerouting traffic as necessary. However, this action caused an outage of its own because the routers crashed when they encountered the new rule.

The solution is redundancy

Though many of the outages described above were caused by targeted DDoS attacks, focusing on security as a solution is misguided. No matter how robust a company’s security measures, future attacks are inevitable. Rather, the solution to thwarting attacks like these and maintaining continuous service for customers is to work with multiple DNS providers to build up resiliency and redundancy.

The simplest step toward establishing redundancy is using multiple name servers. Many DNS providers already encourage this, usually suggesting that four to six separate name servers be configured for each domain name. This ensures that if one of the name servers fails, the next one on the list is tried, and so on until a connection can be made.

However, though this kind of name server redundancy is vital, a company is far more likely to survive an external attack if it not only uses redundant name servers but also redundant DNS providers. Setting up a secondary DNS improves the likelihood of a functioning domain name, specifically in situations like DDoS attacks when entire DNS providers are targeted indiscriminately, shutting down all of their name servers. For example, if a company’s domain name is configured to four different DNS name servers, but all from the same managed provider, and that DNS provider is subjected to a major DDoS attack, then the queries cycle through servers 1-4 with no luck, because all those servers are down. However, if the company also has that same domain name pointing to another provider’s name servers, then after the first four servers fail to return a query, the next server on the list has a much higher likelihood of success. It belongs to an entirely different provider, which is not likely to be under duress at the same time. Simply put, providing a larger pool of unrelated server options increases the chances of surviving an attack.

This strategy worked for some websites during the October 21 Dyn attack. By configuring their domain names to servers from two different providers, these companies’ websites were able to stay up and running while others crashed. Their DNS resolvers redirected their usual Dyn traffic to functioning servers from another provider instead.

The trick, however, to maintaining primary and secondary DNS providers is making sure that name server records match across both providers. Every time a name server responds to a lookup, it also sends records that authenticate its response. These records must match all name servers in order for the domain name to function properly. This means that the records on a company’s four managed DNS providers must include its records four times and must be kept in sync. However, not all providers allow their name server records to be edited, which can cause a problem in establishing the kind of redundancy needed. If records aren’t editable, then an administrator may have to push changes to records manually to both providers. Obviously, this is inconvenient and time-consuming and would likely result in some problems and inconsistencies. Route 53 does not allow records to be edited, and neither does DNSimple, although after October 21, DNSimple is reconsidering this policy and working to add the necessary support for record editing to be possible. However, since Route53, Google Cloud DNS, and many of the other providers support APIs, it’s possible to automate the records replication by updating the records in a single source and push the changes via the create/update APIs to the other providers.

Another consideration to keep in mind is that not all providers operate in the same way or allow for the same kind of functionality. For example, some of NS1’s advanced features, such as ASN and IP Prefix Fencing, Sticky Shuffle, and Pulsar, might not be available in other providers, meaning that the records have to rely on both providers’ lowest common denominator. Interoperability among DNS providers is a must in order to stop the damaging effects of DDoS attacks.

So did companies take action after last month’s failure? Sadly, not really. We analyzed the top 500 most-visited websites according to Moz and found that 77% is still running their main DNS service with no redundancy. Even if you remove sites such as Facebook, Google, YouTube, Apple and variation of those (since they are running their own dedicated networks with very high redundancy networks) we found that over 50% of the main daily visited sites are still running their entire infrastructure with a single point of failure.

import dns.resolver
import csv
import re
def is_different_dns(domain):
top_level_names = ['co', 'net', 'org', 'edu', 'gov', 'att', 'go', 'ad']
dns_list = []
try:
answers = dns.resolver.query(domain, 'NS')
first_server_name = ''
entire_name = ''
for rdata in answers:
dns_list.append(rdata.to_text())
for rdata in answers:
if rdata.to_text().split('.')[-3] in top_level_names:
server_name = rdata.to_text().split('.')[-4]
server_name = re.sub(r'\d+', '', server_name)
else:
server_name = rdata.to_text().split('.')[-3]
server_name = re.sub(r'\d+', '', server_name)
if first_server_name == '':
first_server_name = server_name
elif first_server_name != server_name:
return True, dns_list
return False, dns_list
except dns.resolver.NoAnswer, dns.resolver.NoNameservers:
return None, []
if __name__ == '__main__':
counter = 0
none_counter = 0
with open('links.csv', 'r') as input_file:
reader = csv.reader(input_file)
with open('domain_status.csv', 'w') as output_file:
writer = csv.writer(output_file)
for line in reader:
has_more_than_one, dns_list = is_different_dns(line[1].replace('/', ''))
if has_more_than_one:
writer.writerow([line[1], has_more_than_one, dns_list])
else:
writer.writerow([line[1], has_more_than_one, dns_list])
if has_more_than_one == None:
none_counter += 1
else:
counter += 1
writer.writerow(['It seems that {} out of 500 website checked have only 1 DNS server. {} websites did not respond'.format(counter, none_counter)])

Practical recommendations for redundant DNS configuration

1. My personal preference is to consolidate all domain records under a single robust registrar provider. That provider, in my opinion, should be somewhat natural (meaning its main line of business is not selling you domain names); I picked Google Domain Registrar. There are some major benefits to using Google:

  • Private registration is completely free.

  • Migration from your current registrar is automatic, easy, and fast.

  • It is backed by Google DNS infrastructure.

  • It is highly available and secure.

  • Most importantly, it supports a very large number of NS records so you can set up multiple different managed providers as well as your own name servers.

Please note that moving your registrar to Google doesn’t mean you have to use Google as your name server. It’s just an easy way to manage everything in a single reliable place.

2. Use at least two major managed DNS providers. You should always confirm that the provider’s network supports Anycase for higher reliability. My personal preference is using at least four different NS records. The logic behind this is mixing the two managed providers and setting up two servers from each provider. Note that the servers’ order matters, and if one NS fails, requests will still be resolved by the others. But if you are setting up the first two server with the exact same provider and it fails, the request will eventually get answered by the second provider, but it will take more time to receive that response.

3. Set the right TTL. I usually set a pretty short TTL on any record (other than SOA and NS) that has  the potential to be changed or updated. I find it extremely practical to use 3,600 seconds as a default TTL, which is not extremely low, but is manageable in case you want to make fast changes and avoid caching for a very long time. For the SOA record, I typically use a seven-day TTL, as there are no caching implications for the slave since it reads the SOA on receipt of a NOTIFY message or when a refresh value is reached. For an NS record, I recommend using a 14,400-seconds TTL (four hours). By setting it for four hours and combining this configuration with multiple providers and name servers, you are nearly guaranteed to keep responding to DNS requests even in the event of failures. Your other provider redundancy records will respond, and during those four hours, you can replace the unavailable provider while continuing to respond to requests.

4. Test your infrastructure. Don’t wait for a failure—produce one. Always test your DNS infrastructure, keep monitoring it for failures and changes in response times. This can be done using the dig command (see examples below).

5. Build tools. As mentioned before, in order to support a multi-provider strategy, you must synchronize between all providers. The fact that not all providers support secondary zones doesn’t mean you cannot build simple scripts to utilize the provider’s APIs. Once you push a change, it utilizes the public APIs of each provider and creates, updates or deletes the record from all. For example, both Google Cloud and AWS support public APIs that support any DNS record manipulation.

Tools and how to test your DNS

I’m a big fan of the dig command as a DNS information and troubleshooting tool. It can be used to get all of the information related with a single DNS record or the entire Zone.
It will let you perform any valid DNS query, the most common are A (the IP address), TXT (text annotations), MX (mail exchanges), NS name servers, or ANY.

# get the address(es) for cnn.com
dig cnn.com A +noall +answer

# get a list of CNN’s mail servers
dig cnn.com MX +noall +answer

# get a list of DNS servers authoritative for cnn.com
dig cnn.com NS +noall +answer

# get all of the above
dig cnn.com ANY +noall +answer

Another very useful option is the +multiline option that will give you an answer with the SOA records in a verbose multi-line format with human-readable comments.
$ dig +nocmd cnn.com any +multiline +noall +answer
;; Truncated, retrying in TCP mode.
cnn.com. 59 IN A 151.101.128.73
cnn.com. 59 IN A 151.101.64.73
cnn.com. 59 IN A 151.101.192.73
cnn.com. 59 IN A 151.101.0.73
cnn.com. 3599 IN NS ns-1086.awsdns-07.org.
cnn.com. 3599 IN NS ns-1630.awsdns-11.co.uk.
cnn.com. 3599 IN NS ns-47.awsdns-05.com.
cnn.com. 3599 IN NS ns-576.awsdns-08.net.
cnn.com. 3599 IN NS pdns1.ultradns.net.
cnn.com. 3599 IN NS pdns2.ultradns.net.
cnn.com. 3599 IN NS pdns3.ultradns.org.
cnn.com. 3599 IN NS pdns4.ultradns.org.
cnn.com. 3599 IN NS pdns5.ultradns.info.
cnn.com. 3599 IN NS pdns6.ultradns.co.uk.
cnn.com. 899 IN SOA ns-47.awsdns-05.com. awsdns-hostmaster.amazon.com. (
1 ; serial
7200 ; refresh (2 hours)
900 ; retry (15 minutes)
1209600 ; expire (2 weeks)
86400 ; minimum (1 day)
)
As can we see from this example, cnn.com are using two separate seperate authoritive authoritative providers (AWS Route53 and UltraDNS). Since I’m interested in their NS records configuration I’m using the same command, but specifically on the NS record only:
$ dig +nocmd cnn.com NS +multiline +noall +answer
cnn.com. 3599 IN NS pdns2.ultradns.net.
cnn.com. 3599 IN NS ns-576.awsdns-08.net.
cnn.com. 3599 IN NS pdns1.ultradns.net.
cnn.com. 3599 IN NS pdns3.ultradns.org.
cnn.com. 3599 IN NS ns-1086.awsdns-07.org.
cnn.com. 3599 IN NS ns-47.awsdns-05.com.
cnn.com. 3599 IN NS pdns5.ultradns.info.
cnn.com. 3599 IN NS ns-1630.awsdns-11.co.uk.
cnn.com. 3599 IN NS pdns6.ultradns.co.uk.
cnn.com. 3599 IN NS pdns4.ultradns.org.

So as you can see, cnn.com is pointed to two different providers and with each they have configured redundancy. And the TTL is 1 hour.
Another great tool I’m constantly using (which doesn’t require any command line) is this google apps test tool. It basically exposes the dig command, but from a web browser and running from the google infrastructure. Very useful if you want to test your site from a different network.

NS Propagation

To find out if a specific DNS server has been updated with your new DNS settings, dig will let you specify what DNS server, you’d like to run your queries against instead of the default local setup. To do so just use the @ sign after the dig command, for example:
$ dig @8.8.8.8 +nocmd cnn.com NS +multiline +noall +answer
To get a broad real world view picture of your DNS changes and propagation , I recommend using opendns, which checks many of the top DNS servers in different countries; it lets you know if something is wrong and which area in the world is not up to date with your desired configuration.

Looking beyond DNS

Currently, it’s clear that companies are not thinking enough about resiliency in DNS, yet doing so can help solve some of the problems we face on the Internet today. But taking it one step further, there are also many other areas that affect people’s day-to-day existence that could also benefit from resiliency.

For example, establishing redundant content delivery networks (CDNs) can vastly improve performance and provide more flexibility. Etsy is an example of a company that has done this by implementing CDNControl. Not only does running multiple CDNs eliminate a single point of failure, thus ensuring consistent service to consumers, but it also allows for balanced traffic and lower costs.

Redundancy is also vital in cloud providers like AWS, Microsoft Azure, or Google Cloud and in third-party APIs that users rely on, like Stripe or PayPal for making payments or Twilio, which connects Uber drivers with passengers. System failure should not be an option for these essential services, and bolstering effectiveness with resiliency would be a step in the right direction.

I hope that many companies have learned something from the massive Dyn attack last month, and that solutions for future attacks will focus less on security as the root cause of the failure and more on resiliency. Having a variety of server options available is crucial to maintaining consistent and reliable service.