基于多维神经网络深度特征融合的鸟鸣识别算法
Deep Feature Fusion of Multi-Dimensional Neural Network for Bird Call Recognition
-
摘要: 为了进一步提高夜间迁徙鸟鸣监测的准确率,提出一种基于多维神经网络深度特征融合的鸟鸣识别算法。首先,提取鸟鸣对数尺度的梅尔谱图作为VGG Style模型的训练特征,增强时频谱图的能量分布,通过Mix up数据混合生成虚拟数据以减少模型的过拟合。之后,将预训练的VGG Style作为特征提取器对每一段鸟鸣提取深度特征。鉴于不同维度模型的互补性,该文提出分别使用1维CNN-LSTM、2维VGG Style与3维DenseNet121模型作为特征提取器生成高级特征。对于1维CNN-LSTM,使用小波分解作为池化方法,分别对鸟鸣时、频域进行9层小波分解,生成多层LBP特征以获取更丰富的时频信息。最后,对CNN-LSTM与DenseNet121的全连接层进行优化,减少模型参数,提高实时性。实验结果表明,通过融合多维神经网络的深度特征,使用浅层分类器在含有43种鸟类的CLO-43SD数据集中,获得了93.89%的平衡准确率,相较于最新的Mel-VGG与Subnet-CNN融合模型,平衡准确率提高了7.58%。Abstract: In order to improve the accuracy of bird sound monitoring during night migration, this paper proposed a deep feature fusion system of multi-dimensional neural network for bird sound classification. Firstly, we proposed an improved VGG Style model, which used log-scaled Mel spectrogram as training feature to enhance the energy distribution of spectrogram, and generate virtual data by mix up to reduce model over-fitting. Then, the pre-trained VGG Style was used to generate deep features for each bird sound. In view of the complementarity of different dimensional models, 1D CNN-LSTM, 2D VGG Style and 3D DenseNet121 were employed as feature extractors to generate advanced features. For 1D CNN-LSTM, in order to obtain richer time-frequency information, the wavelet decomposition was used as pooling method to extract multi-level LBP features from time domain and frequency domain respectively as training input. Meanwhile, the fully connected layer of CNN-LSTM and DenseNet121 were optimized to reduce model parameters and improve real-time performance. Finally, the deep features of three models were fused and fed to K-nearest neighbor for classification, which got the balanced-accuracy of 93.89% for a public dataset CLO-43SD of 5428 flight calls spanning 43 species and exceeded the latest fusion of Mel-VGG and Subnet-CNN by 7.58%.