Microsoft says a performance update caused Azure to go down for 11 hours this week

A flawed performance update was the reason why many Microsoft services powered by its Azure cloud servers went down for almost 11 hours earlier this week. Microsoft has now formally apologized for the outage, which occurred between Tuesday evening and Wednesday morning.

The effects hit services such as Xbox Live, which didn't allow some Xbox One and Xbox 360 owners to play online. It also affected Visual Studio Online,, and other services. It even affected Microsoft Band owners. Microsoft Azure Corporate Vice President Jason Zander stated the problems began when the the company released what was supposed to be a performance update for Azure Storage. He wrote:

During the rollout we discovered an issue that resulted in storage blob front ends going into an infinite loop, which had gone undetected during flighting. The net result was an inability for the front ends to take on further traffic, which in turn caused other services built on top to experience issues. Once we detected this issue, the change was rolled back promptly, but a restart of the storage front ends was required in order to fully undo the update. Once the mitigation steps were deployed, most of our customers started seeing the availability improvement across the affected regions.

He added that a "limited subset of customers" might still be experiencing problems but that Microsoft is working to fix these issues for those users. The company has now pledged to make some changes so these long outages don't happen again. Those changes include making sure "deployment tools enforce the standard protocol of applying production changes in incremental batches is always followed."

Source: Microsoft (opens in new tab) via ZDNet

John Callaham
  • I think this is a very positive and productive approach. Yes, every article has a question at the end to direct the discussion. Yes, there are those who find it condescending. I think it adds value by directing the conversation into something productive as opposed to the typical rants about why MS is failing by not providing a high-end Lumia. I do not think it is an indictment on us as readers.
    It isn't a big deal. Just a polite nudge of the conversation into a positive direction. I wasn't even affected by this. My XBox One had issues in the past like the controller would fail to connect or Live would sign in and out every 10 seconds but I haven't seen that since the latest updates. This Azure outage didn't seem to hit me though.
  • It's services like Azure that helps us to scale more effectively. My company moved to Azure on June. And we are quite happy with it.
  • Why wasn't this update applied incrementally?
  • Ya... They stated, they would do this going Not sure why it took this incident to realize that.
  • Yeah, not the first time they've had a big outage (Amazon and Google and every cloud provider has had similar if not bigger outages).  I was at a show, supposed to demo our software when they had a major outage.  I couldn't demo a damn thing, it was the worst show ever.
  • We have to depend on these services. Can someone list any viable alternative? The question is, who's services are the most reliable. I'm sure all of the other CSPs are happy to point this giant fumble out and try and push MS out of the top 5.
  • I don't depend on anything on the Internet. Outages like these are inconvenient, but I live most of my life in the physical world.
  • Ron Swanson? Is that you?
  • My life revolves around the internet and yes it affects me.
  • all the big cloud providers have had big outages.  Amazon was down for almost 3 days two years ago.
  • Ya. That's interesting. Expectations are getting higher and higher...meaning minor outages today are becoming as painful as larger outages even just a year ago. With more and more people and devices connecting to the internet and these services each day, they must become freaky-reliable.
  • cloud providers are freaky-reliable. the problem is that most apps don't really rely on all the reliability tools at their disposal because it costs money. Brazil and Australia were ok. If your app falls back there, you're ok. Overall the cloud gives a false sense of security. that 0.001% downtime is there for a reason and most apps that rely on say East US and East US 2 being offline because nobody had a backup plan for the backup plan shows the app creator either over promised reliability to its customers, or simply needs to do more to ensure a reliable service. The best advise I've heard on cloud utilization is to, well utilize it. And yes it will be more expensive the more you do this. And yes, reliability is not cheap. If it was, cloud providers wouldn't be in business.
  • This totally reminds me I need to backup my hard drive. I have the storage but it's literally been 2 years. One hiccup and nearly a terabyte of music could be gone.
  • For most music, there is a great recovery tool called BitTorrent ;)
  • Both AWS and Azure advertise 99.99% availability (uptime).
  • I think this was just a fluke and that people should not expect networks of this size to be perfect. It takes time.
  • I guess you could say that it had... performance anxiety. Where are my sunglasses?
  • Oh noes, affected Band users!!! How were they able to sleep???
  • hahaha. I know right?! You aren't sleeping unless you know what your resting heart rate is! I have a band and like seeing the statistics on my "actual sleep time".. but I'm really looking forward to intelligent alarm settings to wake me when I'm most ready. #FirstWorldProblems
  • Not a big deal to honker about I guess...These kind of outages should be expected with every network provider on some rare occasions...
  • My server still down since failure :(
  • I'm sure somebody is looking for a job today :)
  • Perhaps this explains all the error I've been getting with my Xbox one.
  • My band worked just fine in Texas. Also I'm an IT manager and this is no different than when unexpected this happen within three organization so I'm good with it.
  • My band sync to health service failed Tuesday at 11 pm est
  • Sounds like ops messed up, flighting beta not matching prod... When will we learn!!!!
  • I guess testing was never part of the plan. 
  • They stated they tested first even on live sites with no issues.
  • Well, then, it was poor testing.
  • Given the 11 hours cuts into the 99.99% reliability, are they auto crediting everyone's account this month?
  • MS. Another day.. Another problem :P
  • Some one got fired. That's for sure.
  • Well, there goes the 99.999999999999% availability SLA.. /s
  • What in the world is "flighting", as in "which had gone undetected during flighting"?
  • Got a bit screwed over with Azure. My windows and windows phone game was down for a day and is still experiencing some issues. Almost no ad revenue for 2 days