基于条件变分自编码器的端到端情感语音合成方法
End-to-End Emotional Speech Synthesis Method Based on Conditional Variational Autoencoder
-
摘要: 情感语音合成作为语音合成的一个重要分支,在人机交互领域得到了广泛的关注。如何获得更好的情感嵌入并有效地将其引入到语音合成声学模型中是目前主要存在的问题。表达性语音合成往往从参考音频中获得风格嵌入,但只能学习到风格的平均表示,无法合成显著的情感语音。该文提出一种基于条件变分自编码器的端到端情感语音合成方法(Conditional Duration-Tacotron,CD-Tacotron),该方法在Tacotron2模型的基础上进行改进,引入条件变分自编码器从语音信号中解耦学习情感信息,并将其作为条件因子,然后通过使用情感标签将其编码为向量后与其他风格信息拼接,最终通过声谱预测网络合成情感语音。在ESD数据集上的主观和客观实验表明,与目前主流的方法GST-Tacotron和VAE-Tacotron相比,该文提出的方法可以生成更具表现力的情感语音。Abstract: Emotional speech synthesis, as an important branch of speech synthesis, has received extensive attention in the field of human-computer interaction. How to obtain better emotional embedding and effectively inject them into text-to-speech acoustic models is currently the main problem. Expressive speech synthesis often obtains style embeddings from reference audio, but can only learn the average representation of style, and cannot express an explicit emotional state. In this paper, an effective emotion control method CD-Tacotron is proposed for end-to-end speech synthesis systems, which is improved on the basis of Tacotron2 model by introducing a conditional variational autoencoder to disentangle the emotional information from speech signals and take it as a conditional factor.The emotion labels are encoded into vectors to concatenate with other style information. Other style information is encoded by the latent space and obeys the standard normal distribution. Finally, emotional speech is synthesized through the spectrum prediction network. Subjective and objective experiments on the ESD dataset show that the method proposed in this paper can generate more expressive emotional speech compared to GST-Tacotron and VAE-Tacotron.