Microsoft’s VALL-E Generates Speech From Just 3 Seconds of Audio

But it could lead to a proliferation of deepfake voices.

January 11, 2023

2 Min Read

Microsoft’s VALL-E Generates Speech From Just 3 Seconds of Audio

Microsoft has unveiled VALL-E: an AI model that can generate speech audio from just three-second samples.

VALL-E is capable of text-to-speech synthesis (TTS) off little prior data and could be used for tasks such as speech editing and content creation when combined with other generative AI models like GPT-3.

Trained on 60,000 hours of English language speech from Meta’s LibriLight audio library, VALL-E essentially mimics the target speaker and what they would sound like when speaking a desired text input. It can also maintain the emotion of the speaker in the sample audio.

VALL-E can be demoed via GitHub. According to the Microsoft researchers behind it, the model “significantly outperforms” other zero-shot TTS systems in terms of speech naturalness and speaker similarity.

One possible use for VALL-E could be to narrate audiobooks. Just last week, Apple published a series of audiobooks narrated by an AI voice via its Books app.

For Microsoft, VALL-E represents its latest foray into generative AI. The tech giant is already exploring ways to incorporate OpenAI’s ChatGPT into its Bing search engine and Office line of products.

VALL-E: How Does It Work?

Microsoft describes VALL-E as a neural codec language model. The model was trained on discrete codes derived from the LibriLight library.

During the pre-training stage, the training data used to build VALL-E was scaled up to make it “hundreds of times larger than existing (TTS) systems” like CereProc’s CereVoice or ReadSpeaker, according to the research team behind the model.

“While advanced TTS systems can synthesize high-quality speech from single or multiple speakers, it still requires high-quality clean data from the recording studio. Large-scale data crawled from the Internet cannot meet the requirement, and always lead to performance degradation,” according to the paper’s authors.

“Because the training data is relatively small, current TTS systems still suffer from poor generalization. Speaker similarity and speech naturalness decline dramatically for unseen speakers in the zero-shot scenario.”

Continue reading this article on AI Business

About the Author(s)

Ben Wodecki

Assistant Editor, AI Business

Ben Wodecki is assistant editor at AI Business, a publication dedicated to the latest trends in artificial intelligence.

See more from Ben Wodecki

AI Business

AI Business, an ITPro Today sister site, is the leading content portal for artificial intelligence and its real-world applications. With its exclusive access to the global c-suite and the trendsetters of the technology world, it brings readers up-to-the-minute insights into how AI technologies are transforming the global economy - and societies - today.

See more from AI Business

Related Topics

Recent in Cloud

Related Topics

Recent in OS

Related Topics

Recent in IT Mgmt

Related Topics

Recent in Career

Related Topics

Recent in Storage

Related Topics

Recent in Security

Related Topics

Recent in Dev

Related Topics

Recent in DX

Related Topics

Recent in Infrastructure

Related Topics

Microsoft’s VALL-E Generates Speech From Just 3 Seconds of Audio

VALL-E: How Does It Work?

About the Author(s)

Editor's Choice

Featured How Tos

Recent What Is

Related Topics

Recent in Cloud

Related Topics

Recent in OS

Related Topics

Recent in IT Mgmt

Related Topics

Recent in Career

Related Topics

Recent in Storage

Related Topics

Recent in Security

Related Topics

Recent in Dev

Related Topics

Recent in DX

Related Topics

Recent in Infrastructure

Related Topics

<span class="ArticleBase-LargeTitle">Microsoft’s VALL-E Generates Speech From Just 3 Seconds of Audio</span>Microsoft’s VALL-E Generates Speech From Just 3 Seconds of Audio

VALL-E: How Does It Work?

About the Author(s)

Editor's Choice

Featured How Tos

Recent What Is

Microsoft’s VALL-E Generates Speech From Just 3 Seconds of Audio