‍XIAO Jian,GUO Haiyan,WANG Tingting,et al. Spatial Information enhancement method for sound event localization and detection[J]. Journal of Signal Processing, 2024, 40(12): 2206-2218. DOI: 10.12466/xhcl.2024.12.009.
Citation: ‍XIAO Jian,GUO Haiyan,WANG Tingting,et al. Spatial Information enhancement method for sound event localization and detection[J]. Journal of Signal Processing, 2024, 40(12): 2206-2218. DOI: 10.12466/xhcl.2024.12.009.

Spatial Information Enhancement Method for Sound Event Localization and Detection

  • ‍ ‍Sound event localization and detection (SELD) is an emerging field, which encompasses two subtasks: direction of arrival estimation (DOAE) and sound event detection (SED). Among the several SELD models, the convolutional recurrent neural network (CRNN) model has recently gained interest and is widely adopted for its efficiency and scalability. However, inherent limitations of the CRNN framework have prompted the necessity for innovative solutions to enhance its performance in spatial feature extraction and classification. The fundamental approach of the CRNN model involves leveraging convolutional neural networks (CNNs) to extract features from individual audio channels. However, multichannel time-frequency features are three-dimensional, and using a CNN as a feature extractor would result in the loss of correlation information between diverse channels. The inter-channel correlations have spatial cues intricately linked to the location of the sound source. The lack of spatial information inevitably limits the precision of the direction-of-arrival estimation capabilities of the CRNN model. In addition, most CRNN models employ the cross-entropy loss function as the optimization criterion for training. However, owing to its competitive mechanism in distinguishing inter-class features, the cross-entropy loss function causes feature dispersion in the classification output of the model, consequently degrading its classification performance. To solve these problems, this paper introduces the graph convolutional recurrent neural network (GCRNN) and utilizes a new hybrid loss function to optimize the model during training. The proposed GCRNN model uses graph convolutional neural networks (GCNs) as feature extractors, facilitating the aggregation of information from various audio feature channels. The introduction of GCN not only enables the extraction of spatial information but also preserves spatial information as the model undergoes optimization. Furthermore, the choice of loss function significantly influences the stability of the model. Therefore, our proposed hybrid loss function does not entirely eliminate the cross-entropy loss function but combines the cross-entropy loss function with the angular margin softmax function in a weighted manner. The use of this hybrid loss function aims to incrementally enhance the classification performance of the model without compromising its stability. By contrast, the angular margin softmax narrows the intra-class feature distance and widens the inter-class feature distance by increasing the discriminative difficulty for intra-class features and introducing angular margins for inter-class features. To validate the effectiveness of the proposed method, we conducted experiments on various publicly available datasets. We examined our proposed method from various aspects, including the acquisition method of the adjacency matrix in the GCN, the configuration of the adjacency matrix, the angular margin, and the scaling factor in the hybrid loss function. Additionally, we compared it with state-of-the-art methods. The experimental results indicate that the GCRNN model, utilizing the hybrid loss function, outperforms other sound event localization and detection models in terms of the location-dependent error rate, F1 score, localization recall, and overall SELD score.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return