引入注意力机制的视频声源定位

Video Sound Source Localization with Attention Mechanism

  • 摘要: 在日常生活中视觉事件通常伴随着声音的产生。这表明视频流与音频之间存在某种潜在的联系,本文称之为音视频同步的联合表达。本文将视频流与音频融合并通过训练所设计的神经网络预测视频流和音频是否在时间上同步来学习这种联合表达。与传统音视频信息融合方法不同,本文引入注意力机制,利用视频特征与音频特征的皮尔森相关系数在时间维度和空间维度同时对视频流加权,使视频流与音频关联更加紧密。基于学习到的音视频同步的联合表达,本文进一步利用类激活图方法进行视频声源定位。实验结果表明,所提出的引入注意力机制的音视频同步检测模型可以更好地判定给定视频的音视频是否同步,即更好地学习到音视频同步的联合表达,从而也可以有效地定位视频声源。

     

    Abstract: Visual events are usually accompanied with sounds in our daily lives. It suggests that there is some potential connection between the video frames and the sound, which is called the joint expression of audio-visual synchronization in this paper. In order to learn this joint expression, the video frames and audio of a video signal are fused and the designed neural network is trained to predict whether video frames and audio are synchronized in time. Different from the traditional audio-visual information fusion methods, this paper introduces the attention mechanism, and uses the Pearson correlation coefficient of the video feature and the audio feature to weight the video stream simultaneously in the time dimension and the spatial dimension, so that the video frames are related to the audio more closely. Based on the learned joint expression of audio-visual synchronization, the sound source localization is carried out by the class activation map method. Experimental results show that the proposed audio-visual synchronization detection model which introduces the attention mechanism can better determine whether the audio and video frames are synchronized, which means the joint expression of audio-visual synchronization can be better learned. Then, the sound source can be localized effectively.

     

/

返回文章
返回