Meta Platforms' AI research arm introduces Voicebox, a groundbreaking text-to-speech model with unique capabilities. It can perform tasks like editing, noise removal, and style transfer, even without specific training.
Voicebox, trained using Meta's "Flow Matching" technique, can synthesize speech in six languages. Its goal is to build a versatile model that excels in text-guided speech generation tasks through in-context learning.
With 50,000 hours of speech and transcripts from audiobooks, Voicebox learns to generate speech from text by predicting masked segments using surrounding audio and transcripts as context. The result? Natural-sounding speech in a generalizable manner.
Voicebox's versatility shines as it tackles various tasks beyond its training scope. It can bring speech to those who can't speak, customize voices for virtual assistants or game characters, perform style transfer, and even edit out mistakes or background noise.
An exciting application of Voicebox is voice sampling. It generates multiple speech samples from a single text sequence, allowing for synthetic data generation to train other speech processing models effectively.
While Voicebox shows tremendous potential, ethical concerns regarding AI-generated content have led Meta to withhold its release. Meta acknowledges the risk of misuse and unintended harm. Technical details are provided to address these concerns and mitigate associated risks.
Voicebox has its limits, struggling with conversational speech and fine-grained control over attributes like voice style and emotion. However, Meta's research team is actively exploring ways to overcome these limitations in future iterations.
As AI-generated content raises concerns about impersonation and manipulation, Meta is committed to responsible innovation. Meta shares technical insights on Voicebox's architecture and training process, along with a classifier model to detect AI-generated speech and audio.