基于HiFi-GAN的改进型高效声码器
Improved High-efficiency Vocoder Based on HiFi-GAN
-
摘要: HiFi-GAN声码器通过采用缩减网络层的通道数或层数的方式来有效减少模型参数、提高推理速度,但此种方式也严重损害了生成语音的质量。针对此问题,提出了两点改进措施:1.采用多尺度卷积策略对输入Mel谱进行处理来有效表征特征信息;2.采用一维深度可分离卷积替换生成器网络中的标准一维卷积。实验结果表明,多尺度卷积策略有效提升了模型性能,提高了生成语音的质量,而一维深度可分离卷积显著减少了模型参数量并加快了模型推理速度。通过将这两者结合,有效提升了HiFi-GAN模型的性能,具体来说,模型参数量约减少了67.72%,在GPU、CPU上的推理速度分别提升了11.72%、28.98%。此外,语音质量也得到略微提升,平均主观意见分(Mean Opinion Score,MOS)提升了0.07,客观语音质量评估(Perceptual Evaluation of Speech Quality,PESQ)得分提升了0.05。Abstract: The HiFi-GAN vocoder effectively reduces model parameters and improves inference speed by reducing the number of channels or layers of the network layer, but this method also seriously damages the quality of the generated speech. To solve this problem, two improvement measures are proposed: 1.Multi-scale convolution strategy is used to process the input Mel spectrum to effectively characterize the feature information; 2.One-dimensional depthwise separable convolution is used to replace standard one-dimensional convolutions in the generator network. Experimental results show that the multi-scale convolution strategy effectively improves the model performance and the quality of the generated speech, while the one-dimensional depthwise separable convolution significantly reduces the amount of model parameters and speeds up the inference speed. By combining the two, the performance of the HiFi-GAN model is effectively improved. Specifically, the amount of model parameters is reduced by approximately 67.72%, and the inference speed on the GPU and CPU are increased by 11.72% and 28.98%, respectively. In addition, voice quality has also been slightly improved, the Mean Opinion Score (MOS) increased by 0.07 and Perceptual Evaluation of Speech Quality (PESQ) score increased by 0.05.