一种基于改进注意力机制的实时鲁棒语音合成方法

唐君; 张连海; 李嘉欣

doi:10.16798/j.issn.1003-0530.2022.03.010

一种基于改进注意力机制的实时鲁棒语音合成方法

A Real-time Robust Speech Synthesis Method Based on Improved Attention Mechanism

摘要

摘要: 针对现有的语音合成系统Tacotron 2中存在的注意力模型学习慢、合成语音不够鲁棒以及合成语音速度较慢等问题，提出了三点改进措施：1.采用音素嵌入作为输入，以减少一些错误发音问题；2.引入一种注意力损失来指导注意力模型的学习，以实现其快速、准确的学习能力；3.采用WaveGlow模型作为声码器，以加快语音生成的速度。在LJSpeech数据集上的实验表明，改进后的网络提高了注意力学习的速度和精度，合成语音的错误率相比基线降低了33.4%；同时，整个网络合成语音的速度相比之下提升约523倍，实时因子（Real Time Factor，RTF）为0.96，满足实时性的要求；此外，在语音质量方面，合成语音的平均主观意见分（Mean Opinion Score，MOS）达到3.88。

Abstract: In order to solve the problems of the existing speech synthesis system Tacotron 2，such as that the attention model is slow to learn， the synthesized speech is not robust enough， and the synthesized speech speed is slow， three improvement measures are proposed： 1.Use phoneme embedding as input to reduce some mispronunciation problem； 2.Introduce an attention loss to guide the learning of the attention model to realize its fast and accurate learning ability； 3.Use the WaveGlow model as a vocoder to accelerate the speed of speech generation. Experiments on the LJSpeech data set show that the improved network improves the speed and accuracy of attention learning， and the error rate of its synthesized speech is reduced by 33.4% compared to the baseline； at the same time， the speed of synthesized speech of the entire network is increased by approximately 523 times， the Real-Time Factor （RTF） is 0.96， which meets the real-time requirements； in addition， in terms of voice quality， the Mean Opinion Score （MOS） of synthesized speech reaches 3.88.

HTML全文

参考文献(23)

施引文献

资源附件(0)