End-to-End Emotional Speech Synthesis Method Based on Conditional Variational Autoencoder

ZHANG Jianming; PENG Jintao; JIA Hongjie; MAO Qirong

doi:10.16798/j.issn.1003-0530.2023.04.009

ZHANG Jianming, PENG Jintao, JIA Hongjie, MAO Qirong. End-to-End Emotional Speech Synthesis Method Based on Conditional Variational Autoencoder[J]. JOURNAL OF SIGNAL PROCESSING, 2023, 39(4): 678-687. DOI: 10.16798/j.issn.1003-0530.2023.04.009

Citation:

End-to-End Emotional Speech Synthesis Method Based on Conditional Variational Autoencoder

Graphical Abstract

Abstract

Abstract

‍ ‍Emotional speech synthesis， as an important branch of speech synthesis， has received extensive attention in the field of human-computer interaction. How to obtain better emotional embedding and effectively inject them into text-to-speech acoustic models is currently the main problem. Expressive speech synthesis often obtains style embeddings from reference audio， but can only learn the average representation of style， and cannot express an explicit emotional state. In this paper， an effective emotion control method CD-Tacotron is proposed for end-to-end speech synthesis systems， which is improved on the basis of Tacotron2 model by introducing a conditional variational autoencoder to disentangle the emotional information from speech signals and take it as a conditional factor.The emotion labels are encoded into vectors to concatenate with other style information. Other style information is encoded by the latent space and obeys the standard normal distribution. Finally， emotional speech is synthesized through the spectrum prediction network. Subjective and objective experiments on the ESD dataset show that the method proposed in this paper can generate more expressive emotional speech compared to GST-Tacotron and VAE-Tacotron.

FullText(HTML)

References (31)

Supplements (0)

Cited By

End-to-End Emotional Speech Synthesis Method Based on Conditional Variational Autoencoder

Abstract

Catalog

Export File

Citation

Format

Content