基于频谱图转换器的音频场景分类

袁双; 杨立东; 郭勇; 牛大伟; 张丹丹

doi:10.16798/j.issn.1003-0530.2023.04.014

基于频谱图转换器的音频场景分类

Audio Scene Classification Based on Audio Spectrogram Transformer

摘要

摘要: 音频场景分类是场景理解重要的一环，学习音频场景特征并精准分类能加强机器与环境的交互能力，在大数据时代其重要性不言而喻。鉴于分类任务表现依赖数据集规模，但实际任务中又面临数据集严重不足的情况，本文提出了数据增强和网络模型预训练策略，将频谱图转换器模型和音频场景分类任务相结合。首先，提取音频信号对数梅尔能量频谱图输入模型，然后通过模型动态交互能力，加强音频序列空间关系，最后由标记向量完成分类。将本文方法在DCASE2019task1和DCASE2020task1公开数据集上进行测试，分类准确率分别达到了96.489%和93.227%，与已有算法相比有明显的提升，说明本方法适用高精度音频场景分类任务，为高精度智能设备感知环境内容、检测环境动态打下了基础。

Abstract: ‍ ‍Audio scene classification was an important part of scene understanding. Learning the characteristics of audio scenes and accurate classification can strengthen the interaction between machines and the environment， and its importance is self-evident in the age of big data. In view of the fact that the performance of classification task depends on the size of the dataset， but the actual task is faced with a serious shortage of data sets， this paper proposed a data enhancement and network model pre-training strategy， which combined the audio spectrogram transformer model with the audio scene classification task. First， extracted the input model of the log-Mel energies spectrum of the audio signal， then strengthened the spatial relationship of the audio sequence through the dynamic interaction ability of the model， and finally complete the classification by the tag vector. The method in this paper is tested on the public datasets of DCASE2019task1 and DCASE2020task1， and the classification accuracy rates are 96.489% and 93.227% respectively， which is significantly improved compared with the existing algorithms， indicating that this method is applicable to high-precision audio scene classification tasks， laying a foundation for high-precision intelligent devices to perceive environmental content and detect environmental dynamics.

HTML全文

参考文献(19)

施引文献

资源附件(0)