ZHANG Tianqi, LUO Qingyu, ZHANG Huizhi, FANG Rong. Speech Enhancement Method Based on Complex Spectrum Mapping with Efficient Transformer[J]. JOURNAL OF SIGNAL PROCESSING, 2024, 40(2): 406-416. DOI: 10.16798/j.issn.1003-0530.2024.02.018
Citation: ZHANG Tianqi, LUO Qingyu, ZHANG Huizhi, FANG Rong. Speech Enhancement Method Based on Complex Spectrum Mapping with Efficient Transformer[J]. JOURNAL OF SIGNAL PROCESSING, 2024, 40(2): 406-416. DOI: 10.16798/j.issn.1003-0530.2024.02.018

Speech Enhancement Method Based on Complex Spectrum Mapping with Efficient Transformer

  • ‍ ‍Convolution neural networks (CNNs) perform well in speech enhancement but fail to capture global features. In addition, in recent years, a transformer has shown the advantage of long sequence dependence, but there have been problems such as local detail feature loss and a large number of parameters. In order to make full use of the advantages of a CNN and transformer and make up for their shortcomings, this study investigated a single channel speech enhancement network based on the fusion of novel convolutional modules and an efficient transformer under complex spectrum mapping. The network was composed of a coding layer, transmission layer, and double-branch decoding layer. In the codec part, a collaborative learning block was designed to supervise the interactive information, which reduced the number of parameters and improved the ability of the backbone network to obtain complex features. In the transmission layer, a time-frequency spatial attention transformer module was proposed to model the voice sub-band and full band information, and made full use of acoustic characteristics to simulate local spectrum patterns and capture the harmonic dependency. By combining this module with a channel attention branch, a learnable dual-branch attention fusion (DAF) mechanism was designed to extract contextual features from the space-channel perspective to enhance multi-dimensional information transmission. Finally, a Gaussian weighted asymptotic network was built as the intermediate transmission layer, and the weighted sum output was performed by stacking DAF modules to make full use of the deep features and make the decoding process more robust. Ablation and comprehensive comparison experiments were conducted on the English VoiceBank-DEMAND dataset, the Chinese THCHS30 corpus, and 115 kinds of ambient noise. The results showed that the proposed method achieved better subjective and objective indicator results with the minimum number of parameters (0.68 × 106) compared with most of the latest correlation network models. It demonstrated outstanding enhancement performance and generalization ability.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return