The 10 worst cloud outages (and what we can learn from them)

Sending your IT business to the cloud comes with risk

As a concept, there’s a lot to like about the cloud. Drop those bulky servers and get yourself a big, white hard drive in the sky. Someone else handles the upkeep and lets you put your data where you want it. Even the word “cloud” itself brings to mind a heavenly (if slightly fluffy) fantasy.

The reality is, of course, a mixed bag. What you gain in avoiding upkeep, you lose in control. And the security concerns are considerable. But nowhere is the nightmare as vivid as it is when your cloud service goes down.

To help keep your business pain-free in the cloud, we offer these hard-earned lessons at the hands of 10 of the worst cloud storms the Web has weathered.

by JR Raphael, InfoWorld

Colossal cloud outage No. 1: Amazon Web Services goes poof

Freeing yourself from network maintenance gruntwork is a chief selling point for doing business in the cloud. The downside? Standing by helplessly when your cloud vendor’s routine configuration change grinds your business to a halt.

That is what many AWS customers experienced this past April, when Amazon’s Northern Virginia data center suffered a glitch and — to use the technical term — went totally nutso.

The error started during a network upgrade, when a misrouted traffic shift sent a cluster of Amazon EBS (Elastic Block Store) volumes into a remirroring storm, as they sought out available boxes into which they could insert backups of themselves — perverse, I know. That set off a series of events that ultimately took down much of the company’s U.S. East Region.

The problems persisted for about four days. But while many businesses struggled, others such as Netflix took the storm in stride. The key to survival? Designing your systems with these types of failures in mind.

Colossal cloud outage No. 2: The Sidekick shutdown

Smartphones make it easy to access your data on the go, but just because something has “smart” in its name doesn’t mean it can’t be dumb. Case in point: the T-Mobile Sidekick screwup.

Remember this fiasco? The Microsoft-owned Sidekick suffered a nearly week-long service outage that left users without access to email, calendar info, and other personal data. Then, adding insult to injury, Microsoft confessed it had completely lost the cloud-stored bits and wouldn’t be able to restore them. Evidently, the good ol’ gang from Redmond had forgotten to make backups.

The technology may have evolved since then, but the lesson remains the same: When it comes to crucial data, never assume someone else is automatically protecting you. Make sure you understand your cloud provider’s disaster recovery setup — better yet, make your own arrangements to back up your important data independently.

“The same operational rules apply even in the cloud,” says Ken Godskind, vice president of monitoring products for AlertSite, “(You) can’t just assume that because it’s in the cloud, all the responsibility for business continuity planning has somehow been transferred to the provider.”

Colossal cloud outage No. 3: Gmail fail

Of all cloud services, Google’s Gmail presents one of the more likely threats to Microsoft’s on-premises stranglehold on the enterprise. Replace your high-maintenance Exchange servers with a cheap, dependable email service backed by Postini. What’s not to like?

A rash of irksome outages, the most recent of which had 150,000 Gmail users signing into their accounts only to find blank slates — no emails, no folders, nothing that indicated they were actually looking at their own inboxes. To Google’s credit, it provided regular updates and promised a quick fix. But repairs took as long as four days for some of the affected users.

Google ended up having to turn to actual physical tape backups in order to restore the data. Ultimately, the company’s multilayered data protection did work, but not without leaving thousands of users locked out of their email for days.

Is that a reason to run, arms flailing, away from anything cloud-connected? Probably not. But it is a reason to look carefully at your own data safeguards and think about setting up a backup or offline-access solution now, before an urgent need arises.

Colossal cloud outage No. 4: Hotmail’s hot mess

Of course, Microsoft hasn’t always provided the greatest advertisement for its big push for the cloud, either. Witness Microsoft’s Hotmail service, which experienced database errors of its own at the end of 2010, resulting in tens of thousands of empty inboxes at the turn of the new year.

The error, according to Microsoft, stemmed from a script that was meant to delete dummy accounts created for automated testing. The script mistakenly targeted 17,000 real accounts instead.

It took Microsoft three days to restore service for most of those users. An unlucky 8 percent of affected emailers had to wait an extra three days before their data was back where it belonged.

Even Clippy couldn’t smile through a headache like that.

Colossal cloud outage No. 5: The Intuit double-down

Intuit hit a rough patch last year when its cloud-connected services, including popular platforms like TurboTax, Quicken, and QuickBooks, went offline twice within a single month. The worst case was a 36-hour outage in June. A power failure evidently caused things to go haywire, with the company’s primary and backup systems getting knocked completely off the grid.

It only added insult to injury, then, when another apparent power failure hit Intuit weeks later. Among other issues, the second outage appeared to cause an abnormally high rate of obscenity-laden shouting.

“Twenty-five hours downtime is hard to swallow,” one user tweeted at the time. “Passive, opaque and stiff communication from Intuit didn’t help.”

Ouch. “The truth is, there are better solutions than a single cloud if you need absolute availability,” says Chris Whitener, chief strategist of HP’s Secure Advantage program. “It’s not necessarily that you have to duplicate everything, but even putting one extra step in there — maybe backing up crucial data yourself — can make all the difference.”

Colossal cloud outage No. 6: Microsoft’s BPOS oops

It’s hard to be productive when your cloud-based productivity suite bites the virtual dust. That’s what happened to organizations relying on Microsoft’s business cloud offering just weeks ago: The service, named — in true Microsoft style — Microsoft Business Productivity Online Standard Suite, started to stutter around May 10. Paying customers’ email was delayed by as much as nine hours as a result.

Two days later, just when it looked like BPOS was in the clear, the delay returned and outgoing messages started getting stuck in the pipeline, too. If that weren’t enough, Microsoft experienced a separate issue that prevented users from logging into its Web-based Outlook portal as well.

“I’d like to apologize to you, our customers and partners, for the obvious inconveniences these issues caused,” Dave Thompson, corporate vice president for Microsoft Online Services, wrote in a blog.

“I’d also like to apologize for the obvious inconvenience of having to speak 15 syllables every time you say our service’s ridiculous name,” he probably should have added.

Colossal cloud outage No. 7: The Salesforce slipup

An hour of downtime may not sound like much, but when your company holds the keys to the customer service operations of tens of thousands of businesses, more than a few of those organizations are bound to view those 60 minutes as a lifetime. learned this the hard way when its data center shut down last January. Just four days into the new year, reported a full-on failure — meaning services, backups, the whole nine yards were kaput.

“The reality is that cloud-based data centers — guess what? — they go down, too,” says Tim Crawford, chief information officer of All Covered, a division of Konica Minolta.

It’s up to you, he suggests, to decide whether your business’s data can endure occasional downtime — and if not, to make sure your configuration has the resiliency needed to avoid it.

“When you pick a cloud provider, you need to do your homework to understand how they’re providing those services and if they’re able to build a level of redundancy as good or better than what you’re able to do on your own,” Crawford says. “If the answer is no, then why are you using them?”

Colossal cloud outage No. 8: Terremark’s terrible day

These days, Terremark may be making headlines for its billion-dollar Verizon deal, but in early 2010, an extended outage dominated the cloud provider’s coverage.

Terremark’s luck turned sour on St. Patrick’s Day, March 17, 2010. The company’s vCloud Express service took a nosedive that day, with a Miami-based data center going offline for about seven hours. Users were unable to access data stored in the center for the entire period.

Not to get overly redundant, but this brings up the value of redundancy — having your crucial data available on multiple servers in different data centers or, even better, different regions. You could also take the extra step of spreading it among different providers as a failsafe.

“You can pick a series of vendors to host a workload — one as a backup or two as a backup, and then another as your primary,” suggests Harold Moss, chief technology officer of IBM’s Cloud Security Strategy program. “You can then implement your workload there in a secure manner, with the appropriate security, and start to introduce your resiliency capabilities.”

Colossal cloud outage No. 9: The PayPal fall-down

Want a cloud outage with some seriously wide-reaching impact? Try taking PayPal offline for a few hours.

This is no hypothetical exercise: PayPal fell for real in the summer of 2009, leaving millions of merchants around the world with no way to sell their stuff. The service was completely unavailable for about an hour and remained spotty for several more. PayPal said hardware failure was to blame.

It’s a rare kind of outage, no doubt — but with all the sales lost, this unfortunate interruption easily earns a spot in cloud computing’s hall of shame.

Colossal cloud outage No. 10: Rackspace’s rough year

When you provide cloud services to Web presences like TechCrunch and Justin Timberlake, you’d better believe people are going to notice when your servers stop working.

Rackspace learned that lesson a few times in 2009. The cloud provider suffered four high-profile failures throughout the year, adding up to hours of offline time for the company’s customers. One blip was bad enough that Rackspace had to pay out nearly $3 million in service credits to its users.

Rackspace called the incidents “painful and very disappointing” and promised to “execute at a high level for a long time” after. Today, the company continues to focus on uptime but also works to help users plan for the inevitable turbulence that comes with life in the cloud.

“If you want to cluster a server or build geographical redundancy, it’s easier to do now than it ever was before, but you have to actually take those steps,” says Rackspace’s Lew Moorman. “The cloud doesn’t bring inherent weaknesses that weren’t present if you did things in-house before.”

Would you recommend this article?


Thanks for taking the time to let us know what you think of this article!
We'd love to hear your opinion about this or any other story you read in our publication.

Jim Love, Chief Content Officer, IT World Canada

Featured Download

CDN Staff
CDN Staff
For over 25 years, CDN has been the voice of the IT channel community in Canada. Today through our digital magazine, e-mail newsletter, video reports, events and social media platforms, we provide channel partners with the information they need to grow their business.

CDN in your inbox

CDN delivers a critical analysis of the competitive landscape detailing both the challenges and opportunities facing solution providers. CDN's email newsletter details the most important news and commentary from the channel.

More Slideshows