VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment
-
Abstract
With the aid of discrete neural audio codecs, large language model have emerged as a promising approach for zero-shot text-to-speech (TTS) synthesis. However, sampling-based decoding strategies, while offering high diversity, often suffer from robustness issues such as typos, omissions, and repetitions. To address these challenges, we propose VALL-E R, a robust and efficient zero-shot TTS system built upon the VALL-E framework. Specifically, we introduce a monotonic phoneme alignment strategy that reinforces the correspondence between phonemes and acoustic sequences by constraining acoustic tokens to their associated phonemes. Additionally, we propose a codec-merging technique to downsample discrete codes in the shallow quantization layer, significantly accelerating decoding without compromising speech quality. These enhancements grant VALL-E R improved phoneme-level controllability and robustness, achieving word error rates (WER) close to those of ground truth. Furthermore, the model reduces autoregressive steps, achieving over a 60% decrease in inference time.
-
-