Microsoft's new Text-to-Speech voices are more 'realistic, lifelike, and engaging'

Microsoft Azure servers
(Image credit: Microsoft)

What you need to know

  • Microsoft recently introduced four "super realistic" Text-to-Speech voices designed for conversational scenarios.
  • They include en-US-AndrewNeural, en-US-BrianNeural, en-US-EmmaNerual, and zh-CN-YunjieNeural, which are available in public preview across three regions: East US, Southeast Asia, and West Europe.
  • Microsoft boasts that the new voices will complement "any application necessitating lifelike speech interactions."
  • The new voices will help enhance interactions by making them realistic and more engaging.

With the exponential growth of AI and its capabilities across the world, there's a rise in the demand for "naturalness and expressiveness in Text-to-Speech voices," according to Microsoft. The company recently announced four new voices, including en-US-AndrewNeural, en-US-BrianNeural, en-US-EmmaNerual, and zh-CN-YunjieNeural.

The tech giant indicated that the new voices are designed for conversational scenarios to ensure user interactions are "more realistic, lifelike, and engaging." The four new voices are available in public preview in three regions: East US, Southeast Asia, and West Europe. 

To demystify the difference between existing voices designed for general purposes and the new voices optimized for conversations, Microsoft also included several demos showcasing the different flavors of the newly incorporated voices. 

Microsoft explained that it's possible to integrate the voices into existing applications via Azure OpenAI, using Azure Speech SDK, REST API, and leveraging Azure Bot Framework's capabilities to develop intelligent bots with the ability to use the new Text-to-Speech (TTS) voices.

We began by crafting the persona of each voice as if it were a real person who is friendly and optimistic about life, always eager to assist others and share intriguing or practical knowledge. The speaking style of the voice resembles a conversation with an acquaintance over a cup of tea, maintaining a natural and unexaggerated tone. Furthermore, we continuously enhance our Text-to-Speech (TTS) modeling techniques to improve the quality of our AI voices. Our most recent projects, such as DelightfulTTS 2, and MuLanTTS, have significantly narrowed the quality gap between AI voices and professional human recordings, producing more natural and realistic voices than ever before. These technological advancements serve as the foundation upon which these new AI voices are built.


Adding a natural and expressive touch

AI has enjoyed several wins and setbacks, with an incline to the latter. There have been several reports indicating that chatbots are getting dumber and also experiencing a decline in accuracy and user base

Perhaps the debut of the new voices will positively impact this trend. Microsoft "offers over 400 neural voices covering more than 140 languages and locales," and those figures seem likely to expand over time.

Kevin Okemwa

Kevin Okemwa is a seasoned tech journalist based in Nairobi, Kenya with lots of experience covering the latest trends and developments in the industry at Windows Central. With a passion for innovation and a keen eye for detail, he has written for leading publications such as OnMSFT, MakeUseOf, and Windows Report, providing insightful analysis and breaking news on everything revolving around the Microsoft ecosystem. You'll also catch him occasionally contributing at iMore about Apple and AI. While AFK and not busy following the ever-emerging trends in tech, you can find him exploring the world or listening to music.