Microsoft AI creates podcasts from text

(Image credit: Future | Daniel Rubino)

What's happening today with Microsoft and AI, then? For once, it's not Copilot being stuffed into something, instead, an interesting new open-source project called VibeVoice.

VibeVoice is an entirely different proposition, focusing on text-to-speech. That is, you enter a chunk of text, the model processes it, and outputs it in audio form with a human-sounding voice.

Microsoft VibeVoice TTS LOCAL Testing – A Multi-Speaker Podcast TTS! - YouTube

Watch On

I recommend watching the video above, too, from Bijan Bowen, that really dives into VibeVoice and its capabilities.

There are multiple versions, with two being available right now to test; a 1.5 billion parameter and a 7 billion parameter. The former can generate up to 90 minutes of audio with a 64k context window, while the latter has a smaller 32k context window and only 45 minutes of audio. But it's presumably higher quality, given the larger model.

There will also be a third, lighter version at 0.5 billion parameters, which is designed for real-time audio generation.

If you're using these locally, per the video above, you can expect to use around 7GB of VRAM on the smaller model, and up to 18GB on the larger one. So, you could run the smaller one on plenty of GPUs, there's no need to have a monstrous local AI rig set up.

Right now, VibeVoice is only trained on English and Chinese, but naturally you'd expect other languages to follow in future refinements.

So, how does it work? You can read the project page to drill down into the meat and potatoes, but the simple version is that you type, it creates the speech.

VibeVoice can generate multi-speaker audio files, thus able to create a full conversation. It can try and sing, but it's pretty hilariously bad right now at that. The voices included so far sound pretty good, and fairly natural, though still clearly AI. Eventually, there are plans to allow for voice cloning, though.

But it can handle emotion, it can speak in multiple languages, which for now only includes English and Mandarin. The upper limit is generating a 90-minute podcast with four different AI speakers.

Of course, there are more uses for text-to-speech than throwing out 'AI slop' podcasts and video voiceovers, even if it's inevitable that will happen. Text-to-speech is a useful accessibility tool, for one.

To give it a try, I fed VibeVoice the opening snippet of text from this article, and asked for an output with a single speaker only. Aside from some random pings that I don't understand, it's decent enough. You can hear it for yourself in the embedded clip above.

It's a very basic test, and the VibeVoice project page has some more advanced examples, with more speakers and showing off both its English and Mandarin capabilities.

The potential is there, though. When the streaming audio version is up and running, it could be a worthwhile addition to chat assistants as well, without the need to rely on someone else's servers.

Find out more on VibeVoice, including how to set it up to run locally, at its GitHub repository or over on Hugging Face.

Richard Devine is a Managing Editor at Windows Central with over a decade of experience. A former Project Manager and long-term tech addict, he joined Mobile Nations in 2011 and has been found on Android Central and iMore as well as Windows Central. Currently, you'll find him steering the site's coverage of all manner of PC hardware and reviews. Find him on Mastodon at mstdn.social/@richdevine

You must confirm your public display name before commenting

Please logout and then login again, you will then be prompted to enter your display name.

YOUR NEXT READ:

Microsoft’s new AI can turn plain text into a full podcast — and it’s freakishly good at it