HAN Bing, QIAN Yanmin. VALL-E R: Robust and efficient zero-shot text-to-speech synthesis via monotonic alignment[J]. Journal of Signal Processing, 2025, 41(9): 1537-1546. DOI: 10.12466/xhcl.2025.09.007.
Citation: HAN Bing, QIAN Yanmin. VALL-E R: Robust and efficient zero-shot text-to-speech synthesis via monotonic alignment[J]. Journal of Signal Processing, 2025, 41(9): 1537-1546. DOI: 10.12466/xhcl.2025.09.007.

VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment

  • With the aid of discrete neural audio codecs, large language model have emerged as a promising approach for zero-shot text-to-speech (TTS) synthesis. However, sampling-based decoding strategies, while offering high diversity, often suffer from robustness issues such as typos, omissions, and repetitions. To address these challenges, we propose VALL-E R, a robust and efficient zero-shot TTS system built upon the VALL-E framework. Specifically, we introduce a monotonic phoneme alignment strategy that reinforces the correspondence between phonemes and acoustic sequences by constraining acoustic tokens to their associated phonemes. Additionally, we propose a codec-merging technique to downsample discrete codes in the shallow quantization layer, significantly accelerating decoding without compromising speech quality. These enhancements grant VALL-E R improved phoneme-level controllability and robustness, achieving word error rates (WER) close to those of ground truth. Furthermore, the model reduces autoregressive steps, achieving over a 60% decrease in inference time.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return