Microsoft's Azure (and other cloud platforms) are always reliable ... until they aren't

Gone are the days when personal computing revolved around our PC and its local hard drive. The internet, a multitude of interconnected personal computing devices and the cloud have made trusting our data to that "ephemeral space" between our devices much easier.

In the early days of personal computing, we hoarded our personal data on a variety of storage medium from hard drives, floppies, jump drives and optical discs. Much like a miser secured his money beneath a mattress in his home we (individuals and business) secured our data the best we could. And just as misers began trusting banks with their money with the expectation that it would be there when they wanted it, we eventually began trusting our data to the cloud with the same expectation.

We expect to be able to access our data, whenever and virtually from where ever we want. Its always supposed to be there. And most of the time it is. Its when we can't get to it, however, that really matters.

Lightning strikes Microsoft

Microsoft Logo

Microsoft Logo (Image credit: Windows Central)

Earlier this month a lightning strike at one of Microsoft's Azure data centers in southern Texas caused a portion of the data center to transfer from utility to generator power. Consequently, the data center's cooling systems eventually failed. This caused excessive temperatures that ultimately damaged storage servers, network devices, and power units. As recovery workers strived to maintain customer's data integrity and to recover systems directly impacted by the storms, secondary effects of this particular outage affected a wide range of other cloud services in other regions. It took days to recover from the impact of this single incident.

The combined influence of the usual reliability of cloud services, the marketed assurance of its integrity and the technical infrastructure of various hardware and software redundancies and safeguards leaves many of us feeling that the cloud is infallible. In fact, companies like Microsoft, Amazon and Google communicate such a degree of confidence in their cloud offerings that infallibility seems inherent to the message. Realistically, though, we know that it is not. In practice, however, we live as if we believe that it is.

Many individuals, businesses and schools almost flippantly convey personal data and trust critical systems to the cloud, (AKA, computers that are miles away from them) where a third party maintains that information and hardware. Our only connection to that data is a vulnerable internet connection that is also maintained by someone other than ourselves. It's humbling and perhaps frightening to think that access to our data and some systems is as fragile as an internet connection.

Heads out of the cloud

Satya Nadella

Satya Nadella (Image credit: Windows Central)

Embracing an increasingly cloud-centric personal computing model requires multiple layers of trust in cloud providers like Microsoft and Google, and network providers like carriers and cable companies. And though the impact to customers was significant when Microsoft's cloud failed its seems those invested in Google's ecosystem would be more severely effected it its cloud failed. With a model centered around cloud-based tools and a browser-based Chrome OS Google seems particularly vulnerable (though limited offline work is possible) to a major impact if its cloud servers failed or connections to users were lost.

The cloud is fallible.

Still, Microsoft's model, and the industry's with 5G and edge computing, is becoming increasingly cloud-centric. And we're blissfully being swept along for the ride with all the accompanying risks and benefits. I am not advocating that the cloud is terrible. Nor am I advising that individuals and entities move away from the cloud in a full embrace of the old way of doing things.

I know that Microsoft has learned from this outage and is building additional safeguards, but any human-made system has vulnerabilities to environmental and human threats and is subject to fail. The recent outage caused by a natural weather condition and which had an impact spanning multiple days is dictating to all of us that the cloud is, indeed, fallible.

What will you do?

We've all jumped on the cloud computing train and most of our experiences have been smooth and uneventful. This may be lulling us into a false sense of the cloud's reliability, however. Perhaps, we should temper our embrace of the cloud with some commonsense notions of its susceptibility to failure. The events impacting Microsoft's Azure datacenter in Texas suggest that indeed we should.

Though it's said that lightning doesn't strike the same place twice the truth is, it does. Thus, if a portion of Microsoft's datacenter can go down once, any cloud datacenter anywhere can be impacted with some form of outage. When that happens, what will you do?

Jason L Ward is a columnist at Windows Central. He provides unique big picture analysis of the complex world of Microsoft. Jason takes the small clues and gives you an insightful big picture perspective through storytelling that you won't find *anywhere* else. Seriously, this dude thinks outside the box. Follow him on Twitter at @JLTechWord. He's doing the "write" thing!

  • I guess that in the future there will be an always on, always connected Windows 10 Azure Edition. I'd love to see what happens then when there's a lightning strike at a data centre.
  • About the same as what'll happen if your power goes out for a few hours or days. While there's the potential for more people to be affected by a single outage for cloud services, that's an eminently solvable engineering problem, as you really just need enough redundancy built into the system. Power outages are still more common than widescale cloud service disruption, and are going to remain a bigger issue for access to services and the internet. About a decade ago, some tree branches falling on some power lines and some bad luck caused a cascade failure that took out the electrical grid for the entire north east US for several days. No part of the system is completely resilient. It's going to go down from time to time. Being able to make due until service is restored is just part of life. As long as outages are rare and fairly short lived it's really not an issue.
  • Yea. No different than your on premise enterprise servers going down either. I have more faith in MS getting things back up then I ever had in my IT dept's abilities. They were great folks, but when you don't see weird problems all the time, you just don't have the depth of expertise that MS has.
  • Indeed. On-prem and Hybrid also have disadvantages but with the size and availability of Azure's global infrastructure compared to it's competition, Azure is going to learn and advance further moving forward.
  • It's not so far away from the future, in fact it's a thing of the past already. You can already get thin clients that'll offer up a RDP to a Azure hosted Windows 10 session if that's what you want to run with.
  • This is not the worst cause for cloud unreliability. Shutting down services due to whatever reasons you don't care about is.
  • I am keen on hearing from the voters how many lightning strikes have they experienced so far in their cloud services, and how many of their services had been shut down that they had been using.
  • My favorite saying: "There is no cloud, it's just someone else's computer." I use OneDrive as a backup, and nothing else. I would never use cloud services as my only place to store anything. This is why.
  • You shouldn't use anything as the only place to store everything. All that has happen so far with cloud storage is a temporary inability to access it. Trusting just local drives can lead to a permanent inability to access it.
  • Exactly what I said, lol
  • They specifically offer hosting options where your data is stored at different locations. So although it sucks that they had these issues, they do offer options to safe-guard your data from events like this. It does seem like there are hosting facility design issues if their generator power isn't working.
  • I recall a similar blog entry several years back, it highlighted that the "BIG" 3 (Amazon, Microsoft & Google) all experienced outages in the same week & the gist was more that if they can spend billions & still not get it right, what are they expecting the regular IT shops to do? There is never "one way / only way" methodology. That said, having some items local for performance (large data sets, file services that tier to the cloud), split emergency services & DR between two cloud providers, leverage services like Amazon & Microsoft that allow for encryption & ownership of AES key (sorry Google you fail), these safeguards will permit continued best practice & financial realities of maintaining I.T. in this always on "usually" connected world.
  • I mean, if you store stuff in the cloud, there's bound to be some lightning eventually. That's where the lightning lives. :-)
  • Ok that was very corny...but it caused a chuckle nonetheless. 😃lol
  • If you aren't making hard backups of your data stored in the cloud, you have a terrible IT policy frankly. You definitely shouldn't be relying 100% on internet storage.
  • The outage wasn't so much about storage as it was services and apps that run from the cloud. We had PowerBI reports and dashboards that were unavailable. We have all the data on site as well, but people rely on being able to open an app on their phone or tablet and check a report - that service was not available for most of a day. There were other impacts, like our Office365 Admin portal that were unavailable as well. The more critical services were back online pretty quick though.
  • I would trust Microsoft's cloud storage much more than any local storage. You should definitely have both though and another off site backup.
  • I would have thought if the cooling system failed then the servers would shut down. That is not very good planning.
    Not that I would rely on the cloud for storage and thankfully I only have to use MS cloud once a week and that is too much.
  • While this was an unfortunate outage, if this had happened at my company's data center, we would probably still be down and rebuilding/recovering. MS had most services back up by the end of the day.
  • Wow, thanks for the info. Now I know cloud services are not 100% reliable.
  • For people like yourself and others that frequent tech sites, the vulnerabilities of cloud services may be something you are more cognizant of. For the millions of "regular" people for whom cloud-based services, products, apps, and AI are being seamlessly weaved into new devices, OS updates and more, they may not realize these "features" of the products they use are not as reliable as they may appear. 😉
  • Why do MSFT or any company for that matter need your data? All of it?
  • Anyone exclusively relying on the cloud for critical systems to their business are fools. Even non-critical systems should have local backups. That's just smart business, even ones who maintain their own servers should have backups obviously.
  • It's all about how you architect the cloud. We do H/A across a DC pair and then DR in a different region. Most businesses could not do that for themselves. While the IT folks noticed this outage, our users carried on uninterrupted. We are a Fortune 500 and have lots of mission critical stuff in the cloud. The plan is to put it all there.
  • Hey moron that wrote the article, stop writing 'us' as if it includes me. I use onedrive to store my data. I also store every bit on 2 different pcs. If that is not enough, I store that same data on 2 separate and local NAS. And in the event of an electric magnetic pulse, that data is also stored in hard drives I update once a month, stored in protective enclosures immune from water, fire, and EMPs. 'Us' mean you and any dumbass that trusts the cloud.
  • Hi max. 'Us' is actually a general term that in the context of the article is simply inclusive of modern humans to whom the term applies. It is not meant to in the strict literal context you seem to have received it in to include "everyone." For instance. I wrote the article. And I store data on multiple external hardrives, multiple PCs, tablets and phones and multiple jump drives and SD card as well as the cloud.. So of course when I say 'us' in relation to individuals with heavy dependence on the cloud, I don't mean me😉 Also, intelligent discourse absent name calling and presumptions is encouraged 😉
  • Maxweltd, I hate to say it, but the World doesn't revolve around you, no matter how big your ego is.
  • Typical comment of someone who has no idea on how to actually use a cloud system properly and all failsafe mechanisms in place to keep things going real smooth. #Azure
  • Well,this is why for a HA system, you always have to use at least 2 geographically separated phisical locations. Whoever does not do so, does not care about high availability or is simply a design amateur.
  • AWS also had an outage that lasted most of a day not that long ago and took down a lot of sites, this is not just a Azure issue. I wonder why the services for both these outages were down for so long. I though the promise of the cloud was that these data centers were mirrored in real time to other data centers? why would it take getting the down data center back online to restore services I would think they would just point traffic to the mirrored data center?
  • It's funny how the world "stops turning" or is noticeably "turning slower" when stuff we use off the internet or cloud suddenly is not there or broken. I suppose it could have been a lot worse. I once saw a data center lose street power, and then its backup UPS failed before it could hand everything off to the generators. That right there was a mess.