What you need to know
- Microsoft published a preliminary root cause analysis of the major Azure outage earlier this week.
- The outage affected several Microsoft services, including Teams, Office 365, and Xbox Live.
- An issue related to the rotation of digital keys caused the outage.
Earlier this week, an Azure Active Directory outage caused several major Microsoft services to go down, including Teams, Xbox Live, and Office 365. The outage impacted hundreds of millions of people who rely on the services for work, education, and entertainment. Microsoft recently published a preliminary root cause anaylsis for the outage (via ZDNet.
Here is Microsoft's summary of the cause of the outage:
Preliminary Root Cause: The preliminary analysis of this incident shows that an error occurred in the rotation of keys used to support Azure AD's use of OpenID, and other, Identity standard protocols for cryptographic signing operations. As part of standard security hygiene, an automated system, on a time-based schedule, removes keys that are no longer in use. Over the last few weeks, a particular key was marked as "retain" for longer than normal to support a complex cross-cloud migration. This exposed a bug where the automation incorrectly ignored that "retain" state, leading it to remove that particular key.
Metadata about the signing keys is published by Azure AD to a global location in line with Internet Identity standard protocols. Once the public metadata was changed at 19:00 UTC, applications using these protocols with Azure AD began to pick up the new metadata and stopped trusting tokens/assertions signed with the key that was removed. At that point, end users were no longer able to access those applications.
The crux of the issue was an error surrounding keys. A key was marked as "retain" for longer than normal, which exposed a bug that caused the "retain" state to be ignored. The key was removed when it shouldn't have been, which caused the issues.
Microsoft is working on a multi-phase effort to prevent these types of issues. Right now, Microsoft is in the second phase of that process. Microsoft explains the effort in the same post:
Azure AD is in a multi-phase effort to apply additional protections to the backend Safe Deployment Process (SDP) system to prevent a class of risks including this problem. The first phase does provide protections for adding a new key, but the remove key component is in the second phase which is scheduled to be finished by mid-year. A previous Azure AD incident occurred on September 28th, 2020 and both incidents are in the class of risks that will be prevented once the multi-phase SDP effort is completed.
Microsoft outlines its upcoming steps, which should prevent these types of outages once finished. Microsoft has put additional safeguards in place to help prevent outages until the second phase of its effort is complete.
Microsoft will publish a full root cause analysis of the outage once its investigation is complete.