YANG Jibin, HUANG Xiang, ZHANG Xiongwei, ZHANG Qiang, MEI Pengcheng. Sound Event Localization and Detection Model Based on Multi-View Attention[J]. JOURNAL OF SIGNAL PROCESSING, 2024, 40(2): 385-395. DOI: 10.16798/j.issn.1003-0530.2024.02.016
Citation: YANG Jibin, HUANG Xiang, ZHANG Xiongwei, ZHANG Qiang, MEI Pengcheng. Sound Event Localization and Detection Model Based on Multi-View Attention[J]. JOURNAL OF SIGNAL PROCESSING, 2024, 40(2): 385-395. DOI: 10.16798/j.issn.1003-0530.2024.02.016

Sound Event Localization and Detection Model Based on Multi-View Attention

  • ‍ ‍In recent years, the performance of sound event localization and detection (SELD) methods based on deep learning have quickly improved. However, in practical applications, the existence of multiple sound sources makes it difficult for the existing SELD models to accurately extract the spatiotemporal information of deep features, which seriously degrades the performance. To study the key information contained in learned multi-channel deep representations, this study investigated a SELD model fused with multi-view attention, which was called the multi-view attention network (MVANet). First, the model adopted a soft-parameter-sharing network as the basic architecture to realize interactive learning between different tasks and calculate a multi-channel deep representation. Based on a comparison of different channel attention mechanisms, we chose multi-head self-attention, which gives attention to intra-channel features, along with a lightweight implementation of channel attention called efficient channel attention (ECA), which gives attention to inter-channel features. The multi-view attention mechanism helped the model to pay more attention to the key features of a deep representation from the perspectives of the channel, time, and frequency, enriching the high-dimensional feature information. Second, based on a comparison of the performances of the ECA module and soft parameter sharing architecture in different positions, we chose the best scheme to extract the multi-view attention and improve the feature representations of the model to the maximum extent. Experimental results showed that the MVANet model improved all the metrics in terms of localization and detection compared to the baseline methods on the TAU-NIGENS Spatial Sound Events 2020 dataset, which contains overlapping acoustic events of the same category. In particular, the detection error rate was reduced by 0.03 and the localization error was reduced by 1.5° in a scenario where multiple sound sources coexisted.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return