A flawed performance update was the reason why many Microsoft services powered by its Azure cloud servers went down for almost 11 hours earlier this week. Microsoft has now formally apologized for the outage, which occurred between Tuesday evening and Wednesday morning.
The effects hit services such as Xbox Live, which didn't allow some Xbox One and Xbox 360 owners to play online. It also affected Visual Studio Online, MSN.com, and other services. It even affected Microsoft Band owners. Microsoft Azure Corporate Vice President Jason Zander stated the problems began when the the company released what was supposed to be a performance update for Azure Storage. He wrote:
During the rollout we discovered an issue that resulted in storage blob front ends going into an infinite loop, which had gone undetected during flighting. The net result was an inability for the front ends to take on further traffic, which in turn caused other services built on top to experience issues. Once we detected this issue, the change was rolled back promptly, but a restart of the storage front ends was required in order to fully undo the update. Once the mitigation steps were deployed, most of our customers started seeing the availability improvement across the affected regions.
He added that a "limited subset of customers" might still be experiencing problems but that Microsoft is working to fix these issues for those users. The company has now pledged to make some changes so these long outages don't happen again. Those changes include making sure "deployment tools enforce the standard protocol of applying production changes in incremental batches is always followed."
Do you think we depend too much on cloud services like Azure?