Microsoft LogoSource: Daniel Rubino / Windows Central

What you need to know

  • Microsoft's AI infrastructure, which is codenamed Singularity, aims to reduce the cost of artificial intelligence.
  • Singularity allows hundreds of thousands of GPUs and AI accelerators to work together, which reduces wasted effort.
  • Microsoft has invested heavily in AI, including a $1 billion investment in OpenAI in 2019.

Microsoft is working to reduce the cost of artificial intelligence (AI) and wasted efforts when computing at a global scale. A recently published paper by Microsoft's Azure and Research teams discusses the company's AI service, which is codenamed Singularity. The paper, titled Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads (PDF), breaks down Microsoft's work at a technical level.

"Singularity is a fully managed, globally distributed infrastructure service for AI workloads at Microsoft, with support for diverse hardware accelerators. Singularity is designed from the ground up to scale across a global fleet of hundreds of thousands of GPUs and other AI accelerators," explains Microsoft's Azure and Research teams in their paper. "Singularity is built with one key goal: driving down the cost of AI by maximizing the aggregate useful throughput on a given fixed pool of capacity of accelerators at planet scale, while providing stringent SLAs for multiple pricing tiers."

In layman's terms, Microsoft's Singularity lets hundreds of thousands of GPUs and AI accelerators to work together. Singularity is a global infrastructure service designed to reduce wasted efforts. It treats all devices within the infrastructure as a single cluster, which helps ensure that the devices are used to their full potential.

Singularity can also adapt to prioritize different workloads. "While opportunistically using spare capacity, Singularity simultaneously provides isolation by respecting job-level SLAs," says Microsoft. "For example, Singularity adapts to increasing load on an inference job, freeing up capacity by elastically scaling down or preempting training jobs."

In contrast to some other systems that require restarting from scratch following failure, Singularity can jump back in where a job was cut off. This greatly reduces wasted effort as DNN training jobs can take several weeks.

Microsoft has invested heavily in AI over the years, including a $1 billion investment in OpenAI in 2019. An Azure system is one of the ten most powerful supercomputers in the world, as of November 2021. Azure systems are used for large-scale computing and machine learning.

As noted by ZDNet, Microsoft used the Singularity codename for an unrelated project in the past. That Singularity was a microkernel operating system.