Abstract:
Cross-modal retrieval aims to retrieve data in one modality by a query in another modality, which has been an interesting research issue in the field of multimedia and information retrieval. However, most existing works focus on tasks of text to image, text to video, and lyrics to audio, limited research has been conducted on cross-modal retrieval of suitable music for a specified video or vice versa. Moreover, much of the existing research relies on metadata such as keywords, tags, or description. This paper introduces a method based on the content of audio and video data modalities implemented with a novel two-branch neural network is to learn the joint embeddings from a shared subspace for computing the similarity between audio and video data modalities. In particular, the contribution of proposed method is mainly manifested in the three aspects: 1)Using feature selection model constructed by attention-based LSTM model for choosing top k audio feature representation, and attention-based inception model for picking up visual feature representation. 2)Propose sample mining mechanism, for reconstructing the structure of the triplets that are ready to be trained, which helps the training more effective. 3)A novel combination of training loss function concerning inter-modal similarity and intra-modal invariance is used. Some promising experiment results evaluated by Recall K, MAP, and Precision-Recall Curves show that the proposed model can be applied to the task of video-music cross-modal retrieval.