Abstract:
Visual events are usually accompanied with sounds in our daily lives. It suggests that there is some potential connection between the video frames and the sound, which is called the joint expression of audio-visual synchronization in this paper. In order to learn this joint expression, the video frames and audio of a video signal are fused and the designed neural network is trained to predict whether video frames and audio are synchronized in time. Different from the traditional audio-visual information fusion methods, this paper introduces the attention mechanism, and uses the Pearson correlation coefficient of the video feature and the audio feature to weight the video stream simultaneously in the time dimension and the spatial dimension, so that the video frames are related to the audio more closely. Based on the learned joint expression of audio-visual synchronization, the sound source localization is carried out by the class activation map method. Experimental results show that the proposed audio-visual synchronization detection model which introduces the attention mechanism can better determine whether the audio and video frames are synchronized, which means the joint expression of audio-visual synchronization can be better learned. Then, the sound source can be localized effectively.