用于声音事件定位与检测的空间信息增强方法

Spatial Information Enhancement Method for Sound Event Localization and Detection

  • 摘要: 声音事件定位与检测包含到达方向估计和声音事件检测两个子任务。作为当前声音事件定位与检测领域中应用最为广泛的模型之一,卷积循环神经网络模型采用卷积神经网络分别从单个音频通道中提取特征,这导致模型丢失了不同通道间的相关信息。然而,通道间的相关信息蕴含了与声源位置相关的空间线索,空间信息的缺失必然会影响模型的到达方向估计性能。此外,卷积循环神经网络模型中使用的交叉熵损失函数还会引起特征分散问题。为解决这些问题,本文提出采用混合损失函数的图卷积循环神经网络模型。具体地,采用图卷积神经网络对不同特征通道间的信息进行聚合,以获取包含更丰富空间信息的特征,来改进卷积循环神经网络模型的到达方向估计性能。在此基础上,结合交叉熵损失函数和角度间隔softmax函数,提出一种新的混合损失函数来解决特征分散问题提高模型的分类性能。实验结果表明,本文提出的采用混合损失函数的图卷积循环神经网络模型在定位相关声音事件检测错误率、F1分数、定位召回率和声音事件定位与检测得分方面均优于其他声音事件定位与检测模型。

     

    Abstract: ‍ ‍Sound event localization and detection (SELD) is an emerging field, which encompasses two subtasks: direction of arrival estimation (DOAE) and sound event detection (SED). Among the several SELD models, the convolutional recurrent neural network (CRNN) model has recently gained interest and is widely adopted for its efficiency and scalability. However, inherent limitations of the CRNN framework have prompted the necessity for innovative solutions to enhance its performance in spatial feature extraction and classification. The fundamental approach of the CRNN model involves leveraging convolutional neural networks (CNNs) to extract features from individual audio channels. However, multichannel time-frequency features are three-dimensional, and using a CNN as a feature extractor would result in the loss of correlation information between diverse channels. The inter-channel correlations have spatial cues intricately linked to the location of the sound source. The lack of spatial information inevitably limits the precision of the direction-of-arrival estimation capabilities of the CRNN model. In addition, most CRNN models employ the cross-entropy loss function as the optimization criterion for training. However, owing to its competitive mechanism in distinguishing inter-class features, the cross-entropy loss function causes feature dispersion in the classification output of the model, consequently degrading its classification performance. To solve these problems, this paper introduces the graph convolutional recurrent neural network (GCRNN) and utilizes a new hybrid loss function to optimize the model during training. The proposed GCRNN model uses graph convolutional neural networks (GCNs) as feature extractors, facilitating the aggregation of information from various audio feature channels. The introduction of GCN not only enables the extraction of spatial information but also preserves spatial information as the model undergoes optimization. Furthermore, the choice of loss function significantly influences the stability of the model. Therefore, our proposed hybrid loss function does not entirely eliminate the cross-entropy loss function but combines the cross-entropy loss function with the angular margin softmax function in a weighted manner. The use of this hybrid loss function aims to incrementally enhance the classification performance of the model without compromising its stability. By contrast, the angular margin softmax narrows the intra-class feature distance and widens the inter-class feature distance by increasing the discriminative difficulty for intra-class features and introducing angular margins for inter-class features. To validate the effectiveness of the proposed method, we conducted experiments on various publicly available datasets. We examined our proposed method from various aspects, including the acquisition method of the adjacency matrix in the GCN, the configuration of the adjacency matrix, the angular margin, and the scaling factor in the hybrid loss function. Additionally, we compared it with state-of-the-art methods. The experimental results indicate that the GCRNN model, utilizing the hybrid loss function, outperforms other sound event localization and detection models in terms of the location-dependent error rate, F1 score, localization recall, and overall SELD score.

     

/

返回文章
返回