A Real-time Robust Speech Synthesis Method Based on Improved Attention Mechanism

TANG Jun; ZHANG Lianhai; LI Jiaxin

doi:10.16798/j.issn.1003-0530.2022.03.010

TANG Jun, ZHANG Lianhai, LI Jiaxin. A Real-time Robust Speech Synthesis Method Based on Improved Attention Mechanism[J]. JOURNAL OF SIGNAL PROCESSING, 2022, 38(3): 527-535. DOI: 10.16798/j.issn.1003-0530.2022.03.010

Citation:

A Real-time Robust Speech Synthesis Method Based on Improved Attention Mechanism

Graphical Abstract

Abstract

Abstract

In order to solve the problems of the existing speech synthesis system Tacotron 2，such as that the attention model is slow to learn， the synthesized speech is not robust enough， and the synthesized speech speed is slow， three improvement measures are proposed： 1.Use phoneme embedding as input to reduce some mispronunciation problem； 2.Introduce an attention loss to guide the learning of the attention model to realize its fast and accurate learning ability； 3.Use the WaveGlow model as a vocoder to accelerate the speed of speech generation. Experiments on the LJSpeech data set show that the improved network improves the speed and accuracy of attention learning， and the error rate of its synthesized speech is reduced by 33.4% compared to the baseline； at the same time， the speed of synthesized speech of the entire network is increased by approximately 523 times， the Real-Time Factor （RTF） is 0.96， which meets the real-time requirements； in addition， in terms of voice quality， the Mean Opinion Score （MOS） of synthesized speech reaches 3.88.

FullText(HTML)

References (23)

Supplements (0)

Cited By

A Real-time Robust Speech Synthesis Method Based on Improved Attention Mechanism

Abstract

Catalog

Export File

Citation

Format

Content