VALL-E R： Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment

HAN Bing; QIAN Yanmin

doi:10.12466/xhcl.2025.09.007

HAN Bing, QIAN Yanmin. VALL-E R: Robust and efficient zero-shot text-to-speech synthesis via monotonic alignment[J]. Journal of Signal Processing, 2025, 41(9): 1537-1546. DOI: 10.12466/xhcl.2025.09.007.

Citation:

VALL-E R： Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment

Abstract

Abstract

With the aid of discrete neural audio codecs， large language model have emerged as a promising approach for zero-shot text-to-speech （TTS） synthesis. However， sampling-based decoding strategies， while offering high diversity， often suffer from robustness issues such as typos， omissions， and repetitions. To address these challenges， we propose VALL-E R， a robust and efficient zero-shot TTS system built upon the VALL-E framework. Specifically， we introduce a monotonic phoneme alignment strategy that reinforces the correspondence between phonemes and acoustic sequences by constraining acoustic tokens to their associated phonemes. Additionally， we propose a codec-merging technique to downsample discrete codes in the shallow quantization layer， significantly accelerating decoding without compromising speech quality. These enhancements grant VALL-E R improved phoneme-level controllability and robustness， achieving word error rates （WER） close to those of ground truth. Furthermore， the model reduces autoregressive steps， achieving over a 60% decrease in inference time.

FullText(HTML)

References (29)

Cited By

VALL-E R： Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment

Abstract

Catalog

Export File

Citation

Format

Content