Skip navigation

Microsoft’s VALL-E Generates Speech From Just 3 Seconds of Audio

But it could lead to a proliferation of deepfake voices.

Microsoft has unveiled VALL-E: an AI model that can generate speech audio from just three-second samples.

VALL-E is capable of text-to-speech synthesis (TTS) off little prior data and could be used for tasks such as speech editing and content creation when combined with other generative AI models like GPT-3.

Trained on 60,000 hours of English language speech from Meta’s LibriLight audio library, VALL-E essentially mimics the target speaker and what they would sound like when speaking a desired text input. It can also maintain the emotion of the speaker in the sample audio.

VALL-E can be demoed via GitHub. According to the Microsoft researchers behind it, the model “significantly outperforms” other zero-shot TTS systems in terms of speech naturalness and speaker similarity.

One possible use for VALL-E could be to narrate audiobooks. Just last week, Apple published a series of audiobooks narrated by an AI voice via its Books app.

For Microsoft, VALL-E represents its latest foray into generative AI. The tech giant is already exploring ways to incorporate OpenAI’s ChatGPT into its Bing search engine and Office line of products.

VALL-E: How Does It Work?

Microsoft describes VALL-E as a neural codec language model. The model was trained on discrete codes derived from the LibriLight library.

During the pre-training stage, the training data used to build VALL-E was scaled up to make it “hundreds of times larger than existing (TTS) systems” like CereProc’s CereVoice or ReadSpeaker, according to the research team behind the model.

“While advanced TTS systems can synthesize high-quality speech from single or multiple speakers, it still requires high-quality clean data from the recording studio. Large-scale data crawled from the Internet cannot meet the requirement, and always lead to performance degradation,” according to the paper’s authors.

“Because the training data is relatively small, current TTS systems still suffer from poor generalization. Speaker similarity and speech naturalness decline dramatically for unseen speakers in the zero-shot scenario.”

Continue reading this article on AI Business

Hide comments


  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.