Amazon Web Services brought down by a typo

After hours of disruptions for services such as project management tool Trello, news website Business Insider and image hoster Giphy, it turns out Amazon Web Services’ (AWS) outage on Tuesday was caused by the simplest of errors: a typo.

S3, Amazon’s popular web hosting and storage platform, crashed on Feb. 28 due to what the company called “high error rates,” but according to new information, an Amazon employee accidentally input the wrong command and took a larger number of servers offline than was intended.

“An authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process,” explains a Mar. 1 post on the AWS website. “Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.”

The servers that were removed supported two other S3 subsystems, including the index subsystem, which manages the metadata and location information of all S3 objects in the region and is necessary for processing GET, LIST, PUT, and DELETE requests, as well as the placement subsystem, which relies on the index system to allocate new storage.

Essentially, as these systems restarted and took longer than expected to get back online, the entire eastern region’s network stayed down.

Amazon says that it will be making several changes to its operations as a result of this incident, which includes limiting the capacity of the tool that took down the servers as well as improving recovery time of key S3 subsystems.

“We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level,” the company says. “This will prevent an incorrect input from triggering a similar event in the future. We are also auditing our other operational tools to ensure we have similar safety checks.

Amazon apologizes for the inconvenience the outage caused for its customers, adding “we will do everything we can to learn from this event and use it to improve our availability even further.”

Would you recommend this article?

Share

Thanks for taking the time to let us know what you think of this article!
We'd love to hear your opinion about this or any other story you read in our publication.


Jim Love, Chief Content Officer, IT World Canada

Featured Download

Mandy Kovacs
Mandy Kovacshttp://www.itwc.ca
Mandy is a lineup editor at CTV News. A former staffer at IT World Canada, she's now contributing as a part-time podcast host on Hashtag Trending. She is a Carleton University journalism graduate with extensive experience in the B2B market. When not writing about tech, you can find her active on Twitter following political news and sports, and preparing for her future as a cat lady.

Related Tech News

Featured Tech Jobs

 

CDN in your inbox

CDN delivers a critical analysis of the competitive landscape detailing both the challenges and opportunities facing solution providers. CDN's email newsletter details the most important news and commentary from the channel.