After hours of disruptions for services such as project management tool Trello, news website Business Insider and image hoster Giphy, it turns out Amazon Web Services' (AWS) outage on Tuesday was caused by the simplest of errors: a typo.
Amazon Web Services, popularly known as AWS, experienced a massive blackout associated with its S3 storage service on February 28th.
Furthermore, one of the affected servers was Amazon's own status page, which for most of the outage showed that everything was running smoothly, even if around 20% of all Internet sites were affected, according to an estimation by Shawn Moore, CTO at Solodev.
One of the parameters for the command was entered incorrectly and took down a large number of servers that support a pair of critical S3 subsystems.
In a statement Thursday, Amazon said that an authorized employee incorrectly entered a command to the system meant to take a small number of servers offline. Engadget reports it took Amazon "longer than expected" to get the problem fixed because it was the first time some servers had been restarted in "many years". The index subsystem manages the metadata and location information of all S3 objects in the region, and the location subsystem "manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate".
After accidentally taking the servers offline, the various systems had to do "a full restart", which apparently takes longer than it does on your laptop. But this poor guy or gal probably made an errant keystroke that crippled AWS for at least four hours.
This caused a cascade of failures that forced the entire S3 system to reboot... and cost companies in the S&P 500 index $150m.
To help identify the problem, the engineer tried to take a small number of S3 subsystems offline, but the typo resulted in a far greater scale of shutdown.
"We want to apologize for the impact this event caused for our customers", the company said.
Looking to avoid a similar snafu, AWS said Thursday it's adding additional safety checks and ways to improve recovery times. The tool will now be removing capacity more slowly, with additional safeguards in place to prevent capacity from being taken down when it will be bringing subsystems below their minimum required levels.