Cloudflare, Microsoft 365 suffer major outages

Two major IT providers suffered service problems this morning, causing CIOs and CISOs hours of grief.

A huge outage affected more than a dozen of content provider Cloudflare’s data centers, which affected a large number of major websites. It began around 2:34 a.m. Eastern time and was reported by the company to be resolved about an hour and a half later.

Ironically, the problem was caused by Cloudflare making a change to increase its resiliency.

Meanwhile the cloud-based Microsoft 365 service also reported outages. Around 6 a.m. Eastern the company tweeted that it was investigating complaints some users were experiencing delays or connection issues when accessing the Exchange Online service. That expanded to the realization that multiple Microsoft 365 services were experiencing delays, connection and search issues. The fault was in the traffic management infrastructure “not working as expected,” the company said around 8 a.m. Eastern. “We’ve successfully rerouted traffic, and we’re seeing an improvement in service availability.”

In a blog this morning Cloudflare officials said traffic in 19 of its data centers were affected. Unfortunately they handle a significant proportion of its global traffic.

“This outage was caused by a change that was part of a long-running project to increase resilience in our busiest locations,” officials said. “A change to the network configuration in those locations caused an outage which started at 06:27 UTC. At 06:58 UTC the first data center was brought back online and by 07:42 UTC all data centers were online and working correctly.”

“We are very sorry for this outage. This was our error and not the result of an attack or malicious activity.”

Over the last 18 months Cloudflare has been trying to convert all of its busiest locations to a more flexible and resilient architecture, the company said. A critical part of this new architecture, which is designed as a Clos network, is an added layer of routing that creates a mesh of connections. This mesh allows Cloudflare to easily disable and enable parts of the internal network in a data center for maintenance or to deal with a problem

Like other IT networks, Cloudflare uses the BGP protocol. As part of this protocol, operators define policies that decide which prefixes (a collection of adjacent IP addresses) are advertised to peers (the other networks they connect to), or accepted from peers.

These policies have individual components, which are evaluated sequentially. The end result is that any given prefixes will either be advertised or not advertised. A change in policy can mean a previously advertised prefix is no longer advertised, known as being “withdrawn”, and those IP addresses will no longer be reachable on the Internet.

While deploying a change to Cloudflare’s prefix advertisement policies, a re-ordering of terms caused the withdrawal of a critical subset of prefixes, causing things to go sideways.

Would you recommend this article?

Share

Thanks for taking the time to let us know what you think of this article!
We'd love to hear your opinion about this or any other story you read in our publication.


Jim Love, Chief Content Officer, IT World Canada

Featured Download

Howard Solomon
Howard Solomon
Currently a freelance writer, I'm the former editor of ITWorldCanada.com and Computing Canada. An IT journalist since 1997, I've written for several of ITWC's sister publications including ITBusiness.ca and Computer Dealer News. Before that I was a staff reporter at the Calgary Herald and the Brampton (Ont.) Daily Times. I can be reached at hsolomon [@] soloreporter.com

Related Tech News

Featured Tech Jobs

 

CDN in your inbox

CDN delivers a critical analysis of the competitive landscape detailing both the challenges and opportunities facing solution providers. CDN's email newsletter details the most important news and commentary from the channel.