Facebook has blamed a configuration mistake for the world-wide outage of its Facebook, Instagram, WhatsApp and other services on Monday.
“Our engineering teams have learned that configuration changes on the backbone routers that co-ordinate network traffic between our data centers caused issues that interrupted this communication,” the company said in a statement late Monday. “This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt.”
“Our services are now back online and we’re actively working to fully return them to regular operations. We want to make clear at this time we believe the root cause of this outage was a faulty configuration change. We also have no evidence that user data was compromised as a result of this downtime.”
The outage reduced the company to using Twitter to communicate with users.
“We’re aware that some people are having trouble accessing our apps and products,” Facebook said in a tweet Monday around 1 pm Eastern. “We’re working to get things back to normal as quickly as possible, and we apologize for any inconvenience.”
Facebook CTO Mike Schroepfer tweeted around 4 pm Eastern that “we are experiencing networking issues and teams are working as fast as possible to debug and restore as fast as possible.”
Around 7 pm Eastern the company tweeted that services were coming back online. That was reflected in a graphic from Downdetector.com, which showed significantly fewer reports of outages after that time.
This map from the site Downdetector.com shows service started to plunge around noon Eastern time on Monday .
There was immediate speculation online that the outage was somehow related to news in the past 24 hours of a whistleblower alleging on 60 Minutes that Facebook’s own research shows that it amplifies hate, misinformation, and political unrest to maximize profits over the good of the public.
The Bleeping Computer news service says the DNS servers on the three services are not responding, which would suggest that this is a DNS configuration or server issue.
Johannes Ullrich, dean of research at the SANS Institute, said the routes directing traffic to Facebook’s IP address space have “disappeared. It is not clear why. But as a result, the Internet doesn’t know where to find Facebooks IP address space. The most obvious symptom of this is that the Facebook DNS servers are no longer reachable. But even if they would work (or if you still have the addresses cached locally), there would be no way to reach the servers.
“It is not clear why the routes have disappeared. Often this is caused by configuration issues, and it can be difficult to fix them as the routers that are misconfigured may no longer be reachable remotely.
“At this point, there is no indication that this is part of an attack, but we do not know until we hear from Facebook as to what the root cause is.”
It appears that DNS and BGP (border gateway protocol) are the cause, said Mark Nunnikhoven, the distinguished cloud strategist at Lacework.
“The primary A and AAA records for Facebook, Instagram, and WhatsApp have been removed from the system and aren’t responding to queries. This means that devices and apps can’t convert facebook.com to the appropriate IP address online.”
“Making things worse (and more interesting), there are reports that Facebook’s BGP (border gateway protocol) routes have been withdrawn. This is the protocol that helps match requests to the best path to the destination services.”
“I’m seeing some DNS resolution come back now,” he said around 2:30 pm Eastern, “but there’s still a lot broken here.”
While some think this is a DNS problem, it appears to be a BGP routing issue, said Andrew Wertkin, chief strategy officer of Canadian DNS solutions provider BlueCat Networks. The routes to Facebook’s networks “have been withdrawn.”
Interviewed shortly before 3 p.m. Eastern, Wertkin said there are four possible explanations:
–Facebook deliberately withdrew the BGP routes because of a massive cyberattack, so for protection took itself off the internet;
–Facebook suffered a catastrophic system or internal network failure;
–“someone” withdrew the routes on Facebook’s behalf, Wertkin thinks that’s unlikely;
-or someone at Facebook made a configuration mistake.
“It looks like a DNS issue because the first thing that happens when you try to reach facebook.com or their other servers is the internet does a DNS lookup, But their DNS is failing because there’s no route to their DNS servers. Their DNS servers may be completely healthy,”
Border Gateway Protocol (BGP) is the routing protocol for the Internet. It picks the most efficient routes for delivering internet traffic. The internet itself is made up of hundreds of thousands of networks known as autonomous systems (AS). As CloudFlare explains, each of these networks is essentially a large pool of routers run by a single organization. These belong to ISPs or other large high-tech organizations, such as tech companies — for example, Facebook — universities, government agencies, and scientific institutions. Each autonomous system wishing to exchange routing information must have a registered autonomous system number (ASN). Every autonomous system must be kept up to date with information regarding new routes as well as obsolete routes. This is done through peering sessions where each AS connects to neighboring AS’s with a TCP/IP connection for the purpose of sharing routing information. If that information is corrupted or withdrawn, that site can’t be found.
BGP highjacking can happen. CloudFlare notes that in April of 2018 attackers deliberately created bad BGP routes to redirect traffic that was meant for Amazon’s DNS service. The attackers were able to steal over $100,000 worth of cryptocurrency by redirecting this traffic to themselves.
More likely are accidents. In 2008 a Pakistani ISP attempted to use a BGP route to block Pakistani users from visiting YouTube. The ISP then accidentally advertised these routes with its neighboring asynchronous systems and the route quickly spread across the Internet’s BGP network. This route sent users trying to access YouTube to a dead end, which resulted in YouTube being inaccessible for several hours.
UPDATE: 6:33 PM ET, Facebook services began to come back online after more than six hours. The company tweeted:
To the huge community of people and businesses around the world who depend on us: we're sorry. We’ve been working hard to restore access to our apps and services and are happy to report they are coming back online now. Thank you for bearing with us.
— Meta (@Meta) October 4, 2021
Updated from the original with comments from Johannes Ullrich, Andrew Wertkin and Mark Nunnikhoven. Stay tuned for updates as more information becomes available.