‍WANG Jingrun,GUO Haiyan,WANG Tingting,et al. Learnable graph ratio mask-based speech enhancement in the graph frequency domain[J]. Journal of Signal Processing, 2024, 40(12): 2249-2260. DOI: 10.12466/xhcl.2024.12.013.
Citation: ‍WANG Jingrun,GUO Haiyan,WANG Tingting,et al. Learnable graph ratio mask-based speech enhancement in the graph frequency domain[J]. Journal of Signal Processing, 2024, 40(12): 2249-2260. DOI: 10.12466/xhcl.2024.12.013.

Learnable Graph Ratio Mask-based Speech Enhancement in the Graph Frequency Domain

  • ‍ ‍Deep neural network (DNN)-based speech enhancement methods in the time-frequency domain generally operate on the complex-valued time-frequency representations obtained through short-time Fourier transform (STFT). With the fine-detailed structures of noisy speech in terms of a complex-valued spectrogram as the input, both the magnitude and phase of clean speech can be estimated using DNNs. Such methods primarily employ complex neural networks to handle complex-valued inputs directly or two-path networks to handle real and imaginary parts separately, resulting in high computational complexity and a large number of model parameters. To address this problem, this study proposes a DNN-based speech enhancement method in the graph frequency domain by utilizing the theory of graph signal processing (GSP) to obtain real-valued inputs instead of complex-valued inputs. Specifically, a novel real symmetric adjacency matrix is defined based on the positional relationships among the samples of speech signals such that the speech signals are represented as undirected graph signals. Through eigenvalue decomposition of the real symmetric adjacency matrix, the graph Fourier transform (GFT) basis is obtained and then utilized to extract the real-valued features of the speech graph signals in the graph frequency domain. Because the GFT basis is closely related to the adjacency matrix, these real-valued features in the graph frequency domain implicitly exploit the relationships among speech samples. Furthermore, by combining the convolution-augmented transformer (conformer) and the convolutional recurrent network (CRN), this study constructs the conformer-based network with graph Fourier transform (GFT-conformer), which is an essentially convolutional encoder-decoder (CED) with four two-stage conformer blocks (TS-conformers) to capture both local and global dependencies of the features in both the time and graph-frequency dimensions, to estimate the targets based on masking to achieve better speech enhancement. Moreover, considering the differences in characteristics between speech and noise across various graph frequency components, this paper introduces the learnable graph ratio mask (LGRM), which facilitates separate control over the mask ranges for different graph frequency components, enabling fine-grained denoising of various graph frequency components to further improve the speech enhancement performance of the GFT-conformer. We evaluate the performance of the proposed GFT-conformer with LGRM on the Voice Bank+DEMAND and Deep Xi datasets in terms of five commonly used metrics. Experimental results show that the proposed GFT-conformer with LGRM achieves a better performance with the smallest model size of 1.4M parameters than several other state-of-the-art DNN-based time-domain and time-frequency domain methods.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return