<b>VALL-E R</b>： 利用单调对齐策略的鲁棒且高效零样本语音合成

韩冰; 钱彦旻

doi:10.12466/xhcl.2025.09.007

VALL-E R：利用单调对齐策略的鲁棒且高效零样本语音合成

韩冰,
钱彦旻

VALL-E R： Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment

摘要

摘要: 借助离散神经音频编解码器的能力，大型语言模型（Large language model， LLM）已被广泛认为是一种零样本语音合成（Text-to-Speech， TTS）的潜在方法。然而，基于采样的解码策略虽然能够为语音生成带来丰富的多样性，但同时也引入了诸如拼写错误、遗漏和重复等鲁棒性问题。为了解决上述问题，我们提出了VALL-E R，一个鲁棒且高效的零样本TTS系统，并以VALL-E为基础进行构建。具体而言，我们引入了一种音素单调对齐策略，通过约束声学标记与其对应的音素严格匹配，增强了音素与声学序列之间的映射关系，从而确保更精确的对齐。此外，我们采用编解码器合并的方法，在浅层量化层对离散码进行降采样，以减少解码计算量，同时保持语音输出的高质量。受益于这些策略，VALL-E R在音素可控性方面取得了显著提升，并通过逼近真实语音的词错误率展现了卓越的鲁棒性。此外，该系统仅需较少的自回归推理步骤，推理时间降低超过60%，极大提升了推理效率。

Abstract: With the aid of discrete neural audio codecs， large language model have emerged as a promising approach for zero-shot text-to-speech （TTS） synthesis. However， sampling-based decoding strategies， while offering high diversity， often suffer from robustness issues such as typos， omissions， and repetitions. To address these challenges， we propose VALL-E R， a robust and efficient zero-shot TTS system built upon the VALL-E framework. Specifically， we introduce a monotonic phoneme alignment strategy that reinforces the correspondence between phonemes and acoustic sequences by constraining acoustic tokens to their associated phonemes. Additionally， we propose a codec-merging technique to downsample discrete codes in the shallow quantization layer， significantly accelerating decoding without compromising speech quality. These enhancements grant VALL-E R improved phoneme-level controllability and robustness， achieving word error rates （WER） close to those of ground truth. Furthermore， the model reduces autoregressive steps， achieving over a 60% decrease in inference time.

HTML全文

参考文献(29)

施引文献

资源附件(0)

VALL-E R： 利用单调对齐策略的鲁棒且高效零样本语音合成

VALL-E R： Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment

VALL-E R：利用单调对齐策略的鲁棒且高效零样本语音合成