基于堆叠交叉注意力的图像文本跨模态匹配方法

王红斌; 张志亮; 李华锋

doi:10.16798/j.issn.1003-0530.2022.02.008

基于堆叠交叉注意力的图像文本跨模态匹配方法

Image-text Cross-modal Matching Method Based on Stacked Cross Attention

摘要

摘要: 图像文本跨模态匹配是计算机视觉与自然语言处理交叉领域的一项重要任务，然而传统的图像文本跨模态匹配方法要么只考虑到全局图像与全局文本匹配，要么只考虑到局部图像与局部文本匹配，无法全面有效的考虑局部和全局信息，导致提取出来的特征信息不完善。或者只是简单的对全局图像与全局文本特征进行提取，局部细节信息无法凸显，导致全局特征无法充分表达其全局语义信息。针对该问题，本文提出一种基于堆叠交叉注意力的图像文本跨模态匹配方法。该方法在考虑局部图像与局部文本匹配的同时，将堆叠交叉注意力引进全局图像与全局文本匹配，通过注意力来进一步挖掘全局特征信息，让全局图像与全局文本特征得到优化，从而提升图像文本跨模态检索的效果。在Flickr30K和MS-COCO两个公共数据集上进行了实验验证，模型的总体性能R@sum（Recall@sum）较baseline（SCAN）分别提高了3.9%与3.7%。该模型与SCAN模型相比，R@sum表现较好。由此表明本文提出方法在图像文本跨模态检索任务上的有效性，并且与现有方法相比具有一定的优越性。

Abstract: Cross-modal matching of image-text is an important task in the intersection of computer vision and natural language processing. However， traditional image-text cross-modal matching methods either only consider global image and global text matching， or only consider local image and local text matching， it cannot fully and effectively consider local and global information， resulting in imperfect feature information extracted. Or simply extracting global image and global text features， local details cannot be highlighted， resulting in global features unable to fully express their global semantic information. To solve this problem， this paper proposes a cross-modal matching method for image-text based on stacked cross attention. While considering the matching of local images and local text， this method introduces stacked cross attention into global image and global text matching， and further mines global feature information through attention， so that global image and global text features are optimized， thereby improving image-text the performance of cross-modal retrieval. Experimental verification was carried out on the two public datasets of Flickr30K and MS-COCO. Experiments were carried out on the two public datasets of Flickr30K and MS-COCO. The overall performance of the model R@sum （Recall@sum） was improved by 3.9% and 3.7% respectively compared with the baseline （SCAN）. Compared with the SCAN model， the R@sum performance of this model is better. This shows that the method proposed in this paper is effective in the task of cross-modal retrieval of image-text， and it has certain advantages compared with the existing methods.

HTML全文

参考文献(31)

施引文献

资源附件(0)