Microsoft unveils VALL-E, an advanced text-to-speech AI that can speak in anyone's voice based on a 3-second sample

By Sofia Elizabella Wyciślik-Wilson
Published 2 years ago

Microsoft has revealed details of its latest foray into the world of artificial intelligence. Billed as a "neural codec language model", VALL-E is an advanced AI-driven text-to-speech (TTS) system that the developers say can be trained to speak like anyone's based on just a three-second sample of their voice.

The result is an incredibly natural-sounding TTS system that takes an entirely different approach to existing systems. Able to convey tone and emotion better than ever, VALL-E sounds realistically human, but there are concerns that it could be used for audio deepfakes.

See also:

The AI has been built and trained using 60,000 hours of audio input from thousands of individuals, including public domain audio books. Working with a short sample, VALL-E is able to closely mimic the tone and timbre of a voice in a way that has simply not been possible previously.

Writing about VALL-E, a team of Microsoft researchers say:

We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called VALL-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. VALL-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt.

The team goes on to say: "Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find VALL-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis".

You can find out more over on the VALL-E demo page where there are numerous samples of how it sounds based on various training inputs.

Image credit: ra2studio / depositphotos

3 Comments

3 Responses to Microsoft unveils VALL-E, an advanced text-to-speech AI that can speak in anyone's voice based on a 3-second sample

Pingback: Dew Drop – January 11, 2023 (#3856) – Morning Dew by Alvin Ashcraft

Microsoft unveils VALL-E, an advanced text-to-speech AI that can speak in anyone's voice based on a 3-second sample

3 Responses to Microsoft unveils VALL-E, an advanced text-to-speech AI that can speak in anyone's voice based on a 3-second sample

Recent Headlines

84 percent of enterprises suffered security incidents in the last year

Best Windows apps this week

CIOs need to anticipate future business challenges

Meeting the challenges of enterprise development [Q&A]

Microsoft launches Windows App so you can connect to Windows from just about any device

Microsoft is giving Windows 11 users (a bit of) control over the in-OS ads they see... but there’s a sting in the tail

SparkyLinux 7.5 arrives with updated kernel and software packages

Most Commented Stories

Windows 12.1 is everything Windows 11 should be -- and the Microsoft operating system we need!

Rectify11 update arrives to fix Windows 11 -- download it now

Apple Intelligence will launch in beta and that’s unacceptable for a trillion-dollar company

Forget TeamViewer, RustDesk is the open-source alternative you've been looking for

Say goodbye to Microsoft Windows 11 and hello to Nitrux Linux 3.6.1

Microsoft is bringing ads to the Windows 10 Start menu, just like in Windows 11

Donald Trump vs. Kamala Harris: Google ramps up efforts to protect Presidential Election integrity

Are you ready for 6G? A breakthrough device just made it possible