采用多视角注意力的声音事件定位与检测
Sound Event Localization and Detection Model Based on Multi-View Attention
-
摘要: 近年来,基于深度学习的方法有效改进了声音事件定位与检测的性能,但当场景中存在多声源重叠时,准确的声源时空信息估计依然较为困难,声音事件定位与检测的性能存在较大提升空间。为充分挖掘多通道深层表示所包含的关键信息,本文提出了一种多视角注意力网络模型MVANet(Multi-View Attention Network)。首先,引入软参数共享网络架构实现不同任务之间的交互学习,计算多通道深层表示,在对比不同通道注意力结构的基础上,选择了一种轻量级的高效通道注意力模块ECA(Efficient Channel Attention)与多头自注意力模块MHSA(Multi-Head Self-Attention)结合,从通道、时间、频率三个视角关注深层表示中的关键特征,丰富高维特征信息。其次,对比了ECA模块和软参数共享架构在MVANet不同位置上的性能,确定了ECA模块和软参数共享在模型上的最佳实现位置,最大程度上提高模型对特征的挖掘能力。仿真结果表明,对于包含同类别重叠声事件的TAU-NIGENS Spatial Sound Events 2020数据集,本文提出的MVANet模型相比较于基线方法,检测和定位性能均得到了改善。在多声源场景下,检测错误率下降了0.03,定位误差下降了1.5°。Abstract: In recent years, the performance of sound event localization and detection (SELD) methods based on deep learning have quickly improved. However, in practical applications, the existence of multiple sound sources makes it difficult for the existing SELD models to accurately extract the spatiotemporal information of deep features, which seriously degrades the performance. To study the key information contained in learned multi-channel deep representations, this study investigated a SELD model fused with multi-view attention, which was called the multi-view attention network (MVANet). First, the model adopted a soft-parameter-sharing network as the basic architecture to realize interactive learning between different tasks and calculate a multi-channel deep representation. Based on a comparison of different channel attention mechanisms, we chose multi-head self-attention, which gives attention to intra-channel features, along with a lightweight implementation of channel attention called efficient channel attention (ECA), which gives attention to inter-channel features. The multi-view attention mechanism helped the model to pay more attention to the key features of a deep representation from the perspectives of the channel, time, and frequency, enriching the high-dimensional feature information. Second, based on a comparison of the performances of the ECA module and soft parameter sharing architecture in different positions, we chose the best scheme to extract the multi-view attention and improve the feature representations of the model to the maximum extent. Experimental results showed that the MVANet model improved all the metrics in terms of localization and detection compared to the baseline methods on the TAU-NIGENS Spatial Sound Events 2020 dataset, which contains overlapping acoustic events of the same category. In particular, the detection error rate was reduced by 0.03 and the localization error was reduced by 1.5° in a scenario where multiple sound sources coexisted.