VALL-E R: 利用单调对齐策略的鲁棒且高效零样本语音合成

VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment

  • 摘要: 借助离散神经音频编解码器的能力,大型语言模型(Large language model, LLM)已被广泛认为是一种零样本语音合成(Text-to-Speech, TTS)的潜在方法。然而,基于采样的解码策略虽然能够为语音生成带来丰富的多样性,但同时也引入了诸如拼写错误、遗漏和重复等鲁棒性问题。为了解决上述问题,我们提出了VALL-E R,一个鲁棒且高效的零样本TTS系统,并以VALL-E为基础进行构建。具体而言,我们引入了一种音素单调对齐策略,通过约束声学标记与其对应的音素严格匹配,增强了音素与声学序列之间的映射关系,从而确保更精确的对齐。此外,我们采用编解码器合并的方法,在浅层量化层对离散码进行降采样,以减少解码计算量,同时保持语音输出的高质量。受益于这些策略,VALL-E R在音素可控性方面取得了显著提升,并通过逼近真实语音的词错误率展现了卓越的鲁棒性。此外,该系统仅需较少的自回归推理步骤,推理时间降低超过60%,极大提升了推理效率。

     

    Abstract: With the aid of discrete neural audio codecs, large language model have emerged as a promising approach for zero-shot text-to-speech (TTS) synthesis. However, sampling-based decoding strategies, while offering high diversity, often suffer from robustness issues such as typos, omissions, and repetitions. To address these challenges, we propose VALL-E R, a robust and efficient zero-shot TTS system built upon the VALL-E framework. Specifically, we introduce a monotonic phoneme alignment strategy that reinforces the correspondence between phonemes and acoustic sequences by constraining acoustic tokens to their associated phonemes. Additionally, we propose a codec-merging technique to downsample discrete codes in the shallow quantization layer, significantly accelerating decoding without compromising speech quality. These enhancements grant VALL-E R improved phoneme-level controllability and robustness, achieving word error rates (WER) close to those of ground truth. Furthermore, the model reduces autoregressive steps, achieving over a 60% decrease in inference time.

     

/

返回文章
返回