TANG Jun, ZHANG Lianhai, LI Jiaxin, LI Yiting. Improved High-efficiency Vocoder Based on HiFi-GAN[J]. JOURNAL OF SIGNAL PROCESSING, 2022, 38(9): 1988-1998. DOI: 10.16798/j.issn.1003-0530.2022.09.021
Citation: TANG Jun, ZHANG Lianhai, LI Jiaxin, LI Yiting. Improved High-efficiency Vocoder Based on HiFi-GAN[J]. JOURNAL OF SIGNAL PROCESSING, 2022, 38(9): 1988-1998. DOI: 10.16798/j.issn.1003-0530.2022.09.021

Improved High-efficiency Vocoder Based on HiFi-GAN

  • ‍ ‍The HiFi-GAN vocoder effectively reduces model parameters and improves inference speed by reducing the number of channels or layers of the network layer, but this method also seriously damages the quality of the generated speech. To solve this problem, two improvement measures are proposed: 1.Multi-scale convolution strategy is used to process the input Mel spectrum to effectively characterize the feature information; 2.One-dimensional depthwise separable convolution is used to replace standard one-dimensional convolutions in the generator network. Experimental results show that the multi-scale convolution strategy effectively improves the model performance and the quality of the generated speech, while the one-dimensional depthwise separable convolution significantly reduces the amount of model parameters and speeds up the inference speed. By combining the two, the performance of the HiFi-GAN model is effectively improved. Specifically, the amount of model parameters is reduced by approximately 67.72%, and the inference speed on the GPU and CPU are increased by 11.72% and 28.98%, respectively. In addition, voice quality has also been slightly improved, the Mean Opinion Score (MOS) increased by 0.07 and Perceptual Evaluation of Speech Quality (PESQ) score increased by 0.05.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return