Microsoft VALL-E shows a future of AI-powered speech that's fantastic and terrifying

(Image credit: Future)

Artificial intelligence promises to squeeze into the limelight for 2023, with the vague term spreading itself into various forms of hardware and software shown at the recent CES convention in Las Vegas, with even more anticipated in the upcoming months.

Refreshed high-end gaming laptops such as the Lenovo Legion Pro boast the ability to intelligently manage internal components for maximum performance using machine learning powered by its Lenovo LA chip. Similar auto-adjustment tech is already present in other devices, making it a relatively unexciting prospect with an extravagant title.

Nevertheless, there's a different side to the pseudo-sentience of AI creeping over the horizon, one that can serve almost equal parts of usefulness and misbehavior if opened to the general public.

Until now, unconvincing and robotic

Utilizing AI to synthesize human speech based on training data has been practiced for a while. Various companies have wrestled with the technology for years to develop something that sounds more natural and convincing to everyday consumers. Still, for the most part, the results are usually wallowing in the uncanny valley.

Perfectly exhibiting subtle nuances in our speech is complicated, no matter how fantastical your underlying tech is. We all speak in various languages divided into subtle accents and even differ in our cadence, and it's unlikely that two people would ever talk precisely in the same way.

It's part of why we've only heard synthesized speech used for entertainment, whether altering recorded voices in short-form videos or emulating a particular famous bodybuilder's accent for comedy dubs of popular movie scenes.

More focused applications in health and medicine afford a more profound use of this cutting-edge tech, aiding those who lose the use of their voice to speak naturally once again with the help of AI. Professor Stephen Hawking famously passed up the chance to replace his famous robotic voice with another since the default setting in his DECtalk-based synthesis tech had already become a recognizable part of his identity.

If more comprehensive recordings of his younger, natural-speaking voice were available, he might have taken advantage of recent AI advancements, but nobody could say for sure besides the man himself.

Microsoft unveils VALL-E

Diagram of the VALL-E speech synthesizer — VALL-E (Image credit: Cornell University)

Training on 60,000 hours of English speech data, a new AI synthesis tool named VALL-E was detailed in a research paper from Cornell University, now under the ownership of Microsoft. Its existence isn't particularly alarming, given that AI has become a significant focus for the company in recent years. Explanations of how little input the system needs to produce surprisingly convincing results are the real eyebrow-raising factor this time, with as little as three-second recordings used to generate entirely new messages, utterly unrelated to the original message.

A demonstration of VALL-E on GitHub includes a wealth of audio samples for anyone to hear, ranging from stiff and unnatural to bordering on perfection. The machine-learning engine is currently unavailable to the general public, as opposed to comparatively rudimentary alternatives such as Uberduck, which realistically extends no further than acting as a fun plaything in its current state.

A single-paragraph ethics statement sits at the foot of the demo to explain that everyone involved in the experiment was willing and approved of the results, followed by an implied caution that this kind of technology should always accompany an agreement of consent of all parties. Given that VALL-E generated such fascinating results with barely a sliver of reference data, the implications for its uses in the open world are complex.

The implication of impersonation

Daniel Rubino using a Lumia phone — (Image credit: Daniel Rubino | Windows Central)

My fascination with AI mimicking real-life humans tends to have me first imagining how it could enrich humanity. A more natural speech pattern could ease some aversion to robotic call operators or breathe new life into information panels in public spaces. Offloading general information provision jobs to humanized machines could mean skipping small talk for consumers if it can progress past the current state of shouting keywords at some basic software.

Deepfake videos already stirred up controversy, and an accompanying voice is the only missing piece to impersonate a person digitally.

As much as I'd prefer speech synthesis to remain in the creative and humanitarian fields, the reality of it applying exclusively to generating audiobooks and comical meme content is overwhelmingly unlikely.

Even if Microsoft were never publicly to release the underlying workings of VALL-E, another competitor would undoubtedly invent an equivalent if given enough time.

Unfortunately, voice actors from my childhood favorite video games and TV shows continue to sadly pass away, leaving a bleak realization that I'll never hear them perform in their iconic roles again. If creative talents agree to preserve their voice in the future, this kind of tech could see exciting applications, but it's always with a sense of potential misuse. Without strict guidelines and control, the probability of nefarious impersonations grows stronger with each iteration of voice synthesis.

This kind of back-and-forth consideration keeps me ambivalent about AI, always wondering how long it might take before generated voices become so convincing that it becomes a real issue. Deepfake videos have already stirred up similar controversy, and an accompanying voice is practically the only missing piece to impersonate a person convincingly.

Treading carefully

Microsoft logo at MWC

Microsoft at MWC (Image credit: Future)

Again, Microsoft is no stranger to the possibilities of AI. With supposed plans to augment Bing search results and the entire Office suite, it makes sense for them to acquire developing tech and get a headstart. It is exciting to see how it might grow within a company that produces the hardware and software I use daily. Still, there's always a lingering thought that it could eventually have unsettling adoptions by nefarious individuals or groups.

I'm still a starry-eyed fan of technology, and advancements like these will always have me picturing how they can improve our everyday lives. Nonetheless, I have spent what feels like every single day of my adult life using the Internet and have seen how the intent of new software sometimes fails to align with the eventual usage.

Perhaps one day, my disembodied voice will read all my articles aloud, but I'll see what Microsoft has planned to make my Excel spreadsheets fancier for now. Maybe Cortana could even make a more talkative comeback; who knows?

Ben is a Senior Editor at Windows Central, covering everything related to technology hardware and software. He regularly goes hands-on with the latest Windows laptops, components inside custom gaming desktops, and any accessory compatible with PC and Xbox. His lifelong obsession with dismantling gadgets to see how they work led him to pursue a career in tech-centric journalism after a decade of experience in electronics retail and tech support.