基于可学习图比率掩码估计的图频域语音增强方法

Learnable Graph Ratio Mask-based Speech Enhancement in the Graph Frequency Domain

  • 摘要: 在基于深度神经网络(deep neural network, DNN)的时频域语音增强方法中,通常将短时傅里叶变换(short-time Fourier transform, STFT)得到的复数域含噪语音时频谱作为DNN输入,以估计纯净语音的幅度和相位。此类方法由于会涉及对复数的运算,计算复杂度和模型参数量较大。针对此问题,本文利用图信号处理(graph signal processing, GSP)理论,提出了基于DNN的图频域语音增强方法。首先,基于语音信号样点间的位置关系定义实对称的邻接矩阵,将语音信号以无向图形式的图信号进行表示,在此基础上利用对应的图傅里叶变换(graph Fourier transform, GFT)提取实数域的语音图频域特征。由于GFT基与邻接矩阵密切相关,该图频域特征隐式地利用了信号样点间的关系,并且可在实数网络中进行处理。然后,构建基于卷积增强transformer(convolution-augmented transformer, conformer)的网络(conformer-based network with graph Fourier transform,GFT-conformer),分别从时间维度和图频率维度捕获图频域特征的局部和全局依赖关系,训练基于掩码的目标,以实现语音增强。最后,考虑到语音和噪声在不同图频率分量上的特性差异,提出可学习图比率掩码(learnable graph ratio mask, LGRM),对不同图频率分量的掩码范围分别进行控制,实现对不同图频率分量的精细化去噪,进一步提升GFT-conformer模型的增强性能。在Voice Bank+DEMAND数据集和Deep Xi数据集上的实验结果表明,所提出的方法在五种常用的评价指标上,优于基于DNN的时域和时频域对比方案。

     

    Abstract: ‍ ‍Deep neural network (DNN)-based speech enhancement methods in the time-frequency domain generally operate on the complex-valued time-frequency representations obtained through short-time Fourier transform (STFT). With the fine-detailed structures of noisy speech in terms of a complex-valued spectrogram as the input, both the magnitude and phase of clean speech can be estimated using DNNs. Such methods primarily employ complex neural networks to handle complex-valued inputs directly or two-path networks to handle real and imaginary parts separately, resulting in high computational complexity and a large number of model parameters. To address this problem, this study proposes a DNN-based speech enhancement method in the graph frequency domain by utilizing the theory of graph signal processing (GSP) to obtain real-valued inputs instead of complex-valued inputs. Specifically, a novel real symmetric adjacency matrix is defined based on the positional relationships among the samples of speech signals such that the speech signals are represented as undirected graph signals. Through eigenvalue decomposition of the real symmetric adjacency matrix, the graph Fourier transform (GFT) basis is obtained and then utilized to extract the real-valued features of the speech graph signals in the graph frequency domain. Because the GFT basis is closely related to the adjacency matrix, these real-valued features in the graph frequency domain implicitly exploit the relationships among speech samples. Furthermore, by combining the convolution-augmented transformer (conformer) and the convolutional recurrent network (CRN), this study constructs the conformer-based network with graph Fourier transform (GFT-conformer), which is an essentially convolutional encoder-decoder (CED) with four two-stage conformer blocks (TS-conformers) to capture both local and global dependencies of the features in both the time and graph-frequency dimensions, to estimate the targets based on masking to achieve better speech enhancement. Moreover, considering the differences in characteristics between speech and noise across various graph frequency components, this paper introduces the learnable graph ratio mask (LGRM), which facilitates separate control over the mask ranges for different graph frequency components, enabling fine-grained denoising of various graph frequency components to further improve the speech enhancement performance of the GFT-conformer. We evaluate the performance of the proposed GFT-conformer with LGRM on the Voice Bank+DEMAND and Deep Xi datasets in terms of five commonly used metrics. Experimental results show that the proposed GFT-conformer with LGRM achieves a better performance with the smallest model size of 1.4M parameters than several other state-of-the-art DNN-based time-domain and time-frequency domain methods.

     

/

返回文章
返回