TANG Jun, ZHANG Lianhai, LI Jiaxin. A Real-time Robust Speech Synthesis Method Based on Improved Attention Mechanism[J]. JOURNAL OF SIGNAL PROCESSING, 2022, 38(3): 527-535. DOI: 10.16798/j.issn.1003-0530.2022.03.010
Citation: TANG Jun, ZHANG Lianhai, LI Jiaxin. A Real-time Robust Speech Synthesis Method Based on Improved Attention Mechanism[J]. JOURNAL OF SIGNAL PROCESSING, 2022, 38(3): 527-535. DOI: 10.16798/j.issn.1003-0530.2022.03.010

A Real-time Robust Speech Synthesis Method Based on Improved Attention Mechanism

  • In order to solve the problems of the existing speech synthesis system Tacotron 2,such as that the attention model is slow to learn, the synthesized speech is not robust enough, and the synthesized speech speed is slow, three improvement measures are proposed: 1.Use phoneme embedding as input to reduce some mispronunciation problem; 2.Introduce an attention loss to guide the learning of the attention model to realize its fast and accurate learning ability; 3.Use the WaveGlow model as a vocoder to accelerate the speed of speech generation. Experiments on the LJSpeech data set show that the improved network improves the speed and accuracy of attention learning, and the error rate of its synthesized speech is reduced by 33.4% compared to the baseline; at the same time, the speed of synthesized speech of the entire network is increased by approximately 523 times, the Real-Time Factor (RTF) is 0.96, which meets the real-time requirements; in addition, in terms of voice quality, the Mean Opinion Score (MOS) of synthesized speech reaches 3.88.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return