Have you ever used a text-to-speech (TTS) system and been disappointed by the unnatural-sounding voice?

A new approach to TTS called VALL-E, introduced in a recent research paper and owned by Microsoft, promises to deliver much more realistic and personalized speech.

What makes It different from existing TTS systems? For one thing, VALL-E is a neural codec language model that treats TTS as a conditional language modeling task rather than continuous signal regression. It’s trained on discrete codes derived from a neural audio codec, which helps it generate more natural-sounding speech.

During pre-training, the model is scaled up to an impressive 60K hours of English speech. This is hundreds of times larger than other TTS systems, which helps give this its in-context learning capabilities. With just a 3-second recording of an unseen speaker as an acoustic prompt, It can synthesize high-quality personalized speech. This is called zero-shot TTS, as the system can generate speech for an unseen speaker without any additional training.

VALL-E is different from previous TTS systems, too. Instead of going from phoneme to mel-spectrogram to waveform, it’s phoneme to discrete code to waveform. This enables various speech synthesis applications, such as zero-shot TTS, speech editing, and content creation combined with other generative AI models like GPT-3.

But there are potential risks to this technology, such as voice identification spoofing or impersonation. To prevent misuse, the article recommends that the speaker approves the use of their voice, and a synthesized speech detection model is included if the model is generalized to unseen speakers in the real world.

Overall, this are owned by Microsoft, shows great promise in delivering natural-sounding and personalized speech. As this technology advances, we can expect to see even more realistic TTS systems that can mimic human speech in all its nuances.

