Abstract:
With the explosive growth of multi-modal data, cross-modal retrieval, as the most commonly-used method to search multi-modal data, has received extensive attention. However, most of the current deep learning methods only use the output of the final fully connected layer as the modal-special high-level semantic representation, ignoring the semantic correlation between features with different scales extracted from multiple levels, thus have certain limitations. In this paper, we proposed a cross-modal hash retrieval method based on feature pyramid fusion representation network. This method designed a feature pyramid fusion representation network. Through feature extraction and fusion at multiple levels and different scales, the semantic correlation of modal features with different scales at multiple levels is mined, and the modal-special features are fully utilized to make the semantic representation of the network output more representative. Finally, a triple loss function is designed to train the model, including the inter-modal loss, the intra-modal loss, and hamming space loss. The experimental results on both MIRFLICKR-25K and NUS-WIDE datasets show that the proposed method in this paper has obtained good cross-modal retrieval results.