Companies around the world rely on Amazon Web Services (AWS) to form the backbone of their presence on the Internet. So when AWS suffers a problem, everybody notices.
On Tuesday, February 28, the Amazon Simple Storage Service (S3), which is the cloud storage part of AWS, was disrupted. Websites and online services started disappearing offline and spewing out errors to visitors. It took Amazon several hours to get a handle on the problem, but we now know the cause: a typo.
In a summary of the disruption, Amazon explained that the S3 engineering team was looking into an issue causing the S3 billing system to function slowly. In order to fix the problem a small number of servers for a subsystem of S3 needed to be taken offline. However, when the command to take them offline was input, a mistake was made. This resulted in a lot more servers being taken down.
That in itself shouldn't have caused a major outage, but some of these additional servers were key to a couple of other S3 subsystems functioning. One of those was the index subsystem, which handles metadata and location information for all S3 objects in the US-EAST-1 region. The other was the placement subsystem, which handles storage allocation for new S3 objects.
Real Life. Real News. Real Voices
Help us tell more of the stories that matterBecome a founding member
- Amazon's Snowmobile Transports 100PB of Data Using a Truck Amazon's Snowmobile Transports 100PB of Data Using a Truck
Both subsystems required rebooting, and while that was happening other parts of AWS started to fail, including the Amazon S3 console, Elastic Compute Cloud (EC2), Elastic Block Stores (EBS), AWS Lambda, and the S3 APIs couldn't be accessed. So basically, a complete meltdown of the system taking several hours to fix all because of a mistyped command.
It should come as no surprise that Amazon is now going to make several changes to the way in which AWS operates in future to avoid this ever happening again. But it just goes to show, it doesn't matter how big and robust a service becomes, it only takes one human with admin privileges to bring it all crashing down.
Subscribe to the newsletter news
We hate SPAM and promise to keep your email address safe