Abstract:
For the large scale weakly labeled data set provided by the Detection and Classification of Acoustic Scenes and Events (DCASE) 2017 Challenge Task 4, we built a multi-class sound event detection system based on the mel filter bank features (Fbank), convolutional neural networks (CNN), and recurrent neural networks (RNN). In this paper, we analyzed the partial deduction process of two existing common pooling layers, attention and linear softmax, in neural network back propagation. On the basis of linear softmax pooling layer, "exponential learnable power function softmax" pooling layer was proposed. Our experimental results show that, compared to the first-placed model in the DCASE competition, the sound event detection system applying the proposed "exponential learnable power function softmax" pooling function increases the clip level F1 value of sound event prediction from 0.556 to 0.652, the frame level F1 value from 0.555 to 0.583 and reduces the frame level error rate (ER) from 0.660 to 0.667.