DVUGAN:基于STDCT的DDSP集成变分U-Net的语音增强
DVUGAN: DDSP Integrated Variational U-Net Speech Enhancement Based on STDCT
-
摘要: 本文提出基于生成对抗网络设计的DVUGAN模型,用于语音增强任务。该模型工作在变换域上,输入采用STDCT特征,该特征能隐式表达相位,可在实值网络中学习,避免了复频域复杂网络或处理,利用相位的同时降低模型复杂度;生成器采用变分U-Net编解码器,集成DDSP组件利用强归纳偏置显著提升自动编码器性能,变分概率瓶颈改善脉冲噪声源的抑制,增加对未知数据分布的鲁棒性;引入DDSP中的Multi-Scale Spectral Loss,利用振荡器感知偏差,指导生成器优化感知性能;将SI-SNR Loss优化判别器性能,以平衡生成对抗网络结构,促使模型稳定训练。该模型在DNS开发数据集和Voice Bank+DEMAND数据集下评估优于基线模型和最近部分研究,证明了本文提出的DVUGAN在变换域语音增强领域的优越性。Abstract: In this paper, a DVUGAN model based on generative adversarial network design is proposed for speech enhancement tasks. The model works in the transform domain, and the input adopts the STDCT feature, which can express the phase implicitly and can be learned in the real valued network, avoiding the complex network or processing in the complex frequency domain, and reducing the complexity of the model while using the phase. The generator uses a variational U-Net codec, integrates DDSP components and utilizes strong inductive bias to significantly improve the performance of the autoencoder. The variational probability bottleneck improves the suppression of pulse noise sources and increases the robustness of unknown data distribution. Multi-scale Spectral Loss in DDSP is introduced to guide the generator to optimize the sensing performance by using the oscillator perception bias. The performance of the discriminant is optimized by the SI-SNR Loss, so as to balance the structure of the adversarial network and promote the stable training of the model. The model is evaluated to be superior to the baseline model and some recent studies in the DNS development dataset and Voice Bank+Demand dataset, which prove the superiority of the proposed DVUGAN in the field of speech enhancement in the transformation domain.