WANG Hongbin, ZHANG Zhiliang, LI Huafeng. Image-text Cross-modal Matching Method Based on Stacked Cross Attention[J]. JOURNAL OF SIGNAL PROCESSING, 2022, 38(2): 285-299. DOI: 10.16798/j.issn.1003-0530.2022.02.008
Citation: WANG Hongbin, ZHANG Zhiliang, LI Huafeng. Image-text Cross-modal Matching Method Based on Stacked Cross Attention[J]. JOURNAL OF SIGNAL PROCESSING, 2022, 38(2): 285-299. DOI: 10.16798/j.issn.1003-0530.2022.02.008

Image-text Cross-modal Matching Method Based on Stacked Cross Attention

  • Cross-modal matching of image-text is an important task in the intersection of computer vision and natural language processing. However, traditional image-text cross-modal matching methods either only consider global image and global text matching, or only consider local image and local text matching, it cannot fully and effectively consider local and global information, resulting in imperfect feature information extracted. Or simply extracting global image and global text features, local details cannot be highlighted, resulting in global features unable to fully express their global semantic information. To solve this problem, this paper proposes a cross-modal matching method for image-text based on stacked cross attention. While considering the matching of local images and local text, this method introduces stacked cross attention into global image and global text matching, and further mines global feature information through attention, so that global image and global text features are optimized, thereby improving image-text the performance of cross-modal retrieval. Experimental verification was carried out on the two public datasets of Flickr30K and MS-COCO. Experiments were carried out on the two public datasets of Flickr30K and MS-COCO. The overall performance of the model R@sum (Recall@sum) was improved by 3.9% and 3.7% respectively compared with the baseline (SCAN). Compared with the SCAN model, the R@sum performance of this model is better. This shows that the method proposed in this paper is effective in the task of cross-modal retrieval of image-text, and it has certain advantages compared with the existing methods.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return