#Amazon today announced the reason behind the huge internet outage that occurred as a result of their S3 #Web Service outage. It seems that the well publicised turmoil in their service provision was caused by a simple typo by one of Amazon's engineers.

The S3 services is part of the broader Amazon Web Services (AWS) group, a hosting service used by many of the world's leading websites such as Giphy, Quora and Slack alongside many thousands more. Engineers had noticed that the S3 servers had been operating rather sluggishly in recent times and assigned relevant staff members to identify and correct the problem. An attempt was made to correct the issue by taking a number of servers offline, action that Amazon described as part of their 'established playbook'.

Advertisements
Advertisements

A turn for the worse

Sadly for Amazon and their customers, due to a typing error in entering the relevant commands, a larger number of servers were taken down than had been required, resulting in the service outage in such a large scale. Amazon stated that the extra servers taken out of action actually supported other parts of the S3 infrastructure.

Procedures within Amazon, and AWS, naturally exist with redundancy in mind, allowing faulty servers to be taken offline without affecting the performance of the system as a whole. However, in this case, the mistake that caused a greater number of servers to be disabled led to major issues further along the line. The process to correct the whole issue required all the S3 systems to be rebooted.

Reasons for delay

Attempting to reboot the affected servers identified a further issue and provided an explanation as to why it took Amazon so long correct the outage.

Advertisements

Some of its servers had not been restarted for a number of years and, as a result, Amazon explained that 'the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected'. Ironically, one of the affected websites was Down Detector, a site that itself tracks any downtime in the major websites across the world.

The fundamental fact is that the S3 service has expanded at a rapid rate and procedures to manage any potential problems seem to be lagging behind. Over 150,000 websites use AWS overall and Amazon issued an apology to them all, promising to do better in the future.

The company went further in announcing that they had put schemes in place to ensure human error would not be able to cause such wide scale problems going forward.