Abstract:
The complicated feature engineering usually plays a significantly important role in the conventional singing voice detection algorithm, while it could be neglected in those algorithms based on the deep neural network because they can learn the features through their strong learning capability. However, the learned features are treated equally in the network despite their different importance for the result. To address this problem, a scaled dot-product attention embedded convolutional neural network was proposed, in which attention distribution for the feature maps was achieved by learning, and then the weights of the feature maps were adjusted so that the convolutional neurons could distinctively “observe” the features in terms of importance, resulting in the overall performance improvements. In the experimental section, compared to the base line model, with the experiments on the two public datasets, the results proved the effectiveness of this algorithm.