Amazon Web Services, popularly known as AWS, experienced a massive blackout associated with its S3 storage service on February 28th. The company has today issued an official statement apologizing for this blunder, stating that human error caused it. It further suggests that the hours-long outage that caused the majority of web services to go offline or act finicky happened due to a typo while debugging the service.
Well, let’s start from the beginning and make headway into how the system shut down progressed due to this error. Due to sluggish operational activity in a handful of its S3 servers, one of the employees decided to debug the service by taking some billing servers offline. This step Amazon says is straight from an “established playbook,” meaning it is the standard procedure to fix such an issue.
But, this is where the employee assigned to debug the S3 servers makes an error. Instead of taking down the necessary S3 servers, he mistakenly makes a typo in the command and removes a larger set of servers than intended. But, this move worsened the situation as it started a chain reaction.
This caused two more important server subsystems at AWS’ oldest data center in the US-EAST-1 region to go offline. And this further affected other services relying on S3 for storage such as e S3 console, Amazon Elastic Compute Cloud (EC2), AWS Lambda and others. Speaking on the same in a blog post, Amazon says,
The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process.
Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems.
To get the systems back up and running, the cloud company had to perform a full restart. But AWS mentions that most of their servers in larger and older regions haven’t been restarted in years. This is because the company has designed its infrastructure to facilitate the removal or failure of significant capacity without customer support. But, the massive blackout was caused due to its debugging activity, so the process took longer than expected due to necessary safety check and protocols. It adds that index subsystem was fully recovered by 1:18 pm PT, while S3 became perfectly operational by 1:54 p.m PT.
Now, since AWS didn’t have any contingencies in place for such errors in the working of their core cloud service, thus, it has decided to make ‘several changes’ to the platform. It added that removal of capacity is a standard operational practice, but their tool allowed too much capacity to be removed too quickly.
This tool is now being modified to remove capacity more slowly and prevent capacity from being removed below a certain level. This will prevent incorrect inputs from triggering such mass-scale changes in the future. This has been described in the blog post as under:
While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level.
Due to the aforementioned errors, a collection of prominent web platforms were either experiencing downtime or operating slower than usual. These included big names such as Quora, Twitch, Kickstarter, Slack, Business Insider, Expedia, Atlassian’s Bitbucket and HipChat among others.