面向跨模态检索的音频数据库内容匹配方法研究

Research on Content Matching Method of Audio Database for Cross-modal Retrieval

  • 摘要: 跨模态检索旨在通过以某一模态的数据为查询词,使人们能够得到与之相关的其他不同模态数据的检索结果的新型检索方法,这已成为多媒体和信息检索领域中一个有趣的研究问题。但是,目前大多数的研究成果集中于文本到图像、文本到视频以及歌词到音频等跨模态相关任务上,而关于如何为特定的视频通过跨模态检索得到合适的音乐这一跨模态的相关研究却很有限。此外,大多现有的关于视频和音频跨模态的研究依赖于元数据(例如关键字,标签或描述)。本文介绍了一种基于音频和视频这两种模态数据内容的跨模态检索的方法,该方法以新型的双流处理网络为框架,并通过神经网络学习两模态数据在公共子空间的特征表达,以计算音频和视频数据之间的相似度。本文所提出的方法的创新点主要在以下三个方面:1)在原有的提取各模态特征的模型基础上引入注意力机制,以此得到了视频和音频的特征选择模型,并筛选出相应的特征表达。2)使用了样本挖掘机制,剔除了无效样本,使得数据的训练更加高效。3)从计算模态间相似性和保持模态内结构不变两方面出发,设计了相应的损失函数进行模型的训练。且所提出的模型在VEGAS数据集和自建数据集上都取得了较高的准确度。

     

    Abstract: Cross-modal retrieval aims to retrieve data in one modality by a query in another modality, which has been an interesting research issue in the field of multimedia and information retrieval. However, most existing works focus on tasks of text to image, text to video, and lyrics to audio, limited research has been conducted on cross-modal retrieval of suitable music for a specified video or vice versa. Moreover, much of the existing research relies on metadata such as keywords, tags, or description. This paper introduces a method based on the content of audio and video data modalities implemented with a novel two-branch neural network is to learn the joint embeddings from a shared subspace for computing the similarity between audio and video data modalities. In particular, the contribution of proposed method is mainly manifested in the three aspects: 1)Using feature selection model constructed by attention-based LSTM model for choosing top k audio feature representation, and attention-based inception model for picking up visual feature representation. 2)Propose sample mining mechanism, for reconstructing the structure of the triplets that are ready to be trained, which helps the training more effective. 3)A novel combination of training loss function concerning inter-modal similarity and intra-modal invariance is used. Some promising experiment results evaluated by Recall K, MAP, and Precision-Recall Curves show that the proposed model can be applied to the task of video-music cross-modal retrieval.

     

/

返回文章
返回