VoiceCraft's Breakthrough in Speech Editing and Synthesis

VoiceCraft's Breakthrough in Speech Editing and Synthesis

The introduction of textless natural language processing (NLP) changed the emphasis to training language models on sequences of learnable, discrete units rather than standard text transcripts. This strategy sought to directly apply NLP tasks to spoken language. In voice editing, such a model is supposed to change words or phrases to match a transcript while retaining the speech's original substance. The research community is presently working on developing a unified model that excels at both zero-shot text-to-speech (TTS) and speech editing, which represents a substantial leap in the area.

A team from the University of Texas at Austin and Rembrand presents VOICECRAFT, a Neural Codec Language Model (NCLM) based on transformer technology. VOICECRAFT produces neural speech codec tokens for infilling using autoregressive conditioning on bidirectional contexts, achieving cutting-edge outcomes in zero-shot TTS and speech editing. This model incorporates a novel two-stage token rearrangement approach that involves delayed stacking and causal masking, enabling autoregressive generation with bidirectional context for speech codec sequences. This method is inspired by the causal masking mechanism employed in successful coupled text-image models.

To improve multi-codebook modeling, VOICECRAFT combines causal masking and delayed stacking. The model was evaluated with REALEDIT, a demanding and diverse dataset constructed by the researchers that included real-world voice editing instances from audiobooks, YouTube videos, and Spotify podcasts. REALEDIT evaluates the model's performance under a variety of editing scenarios, including as additions, deletions, substitutions, and text span alterations. The dataset's variety of material, accents, speaking styles, and environmental noises makes it an effective tool for assessing the feasibility of voice editing algorithms.

In subjective human listening tests, VOICECRAFT surpassed previous voice editing models, including strong baselines such as duplicated VALL-E and the commercial model XTTS v2, in zero-shot TTS and speech editing, requiring no fine-tuning. The model's altered speech closely mimics the original recordings, demonstrating its effectiveness.

However, the team recognizes VOICECRAFT's limitations, which include intermittent silent periods followed by scratching sounds in created speech. Furthermore, the introduction of sophisticated models such as VOICECRAFT creates new hurdles for AI security, particularly in watermarking and recognizing synthetic speech. The researchers have made their code and model weights accessible in order to facilitate future study in AI safety and speech synthesis.


Build your skills with Code Labs Academy’s Data Science and AI Bootcamp, bridging the gap between theory and application.

Code Labs Academy © 2025 All rights reserved.