图像指代分割研究综述
A Survey of Referring Image Segmentation
-
摘要: 图像指代分割作为计算机视觉与自然语言处理交叉领域的热点问题,其目的是根据自然语言描述在图像中分割出相应的目标区域。随着相关深度学习技术的成熟和大规模数据集的出现,这项任务引起了研究者的广泛关注。本文对图像指代分割算法的发展进行了梳理和分析。首先根据多模态信息的编码解码方式,将现有图像指代分割算法分成基于多模态信息融合和基于多尺度信息融合两类进行了系统阐述,重点介绍了基于CNN-LSTM框架的方法、结构复杂的模块化方法和基于图的方法;然后,对用于图像指代分割任务的典型数据集和主流评价指标进行了总结与统计;之后,通过实验综合比较了现有的图像指代分割模型之间的性能差异并进一步验证了各种模型的优缺点。最后,对这一领域现有方法中存在的问题进行讨论分析,并对未来的发展方向进行了展望,表明了针对复杂的指代描述,需要通过多步、显式的推理步骤来解决图像指代分割问题。Abstract: As a hot issue in the cross field of computer vision and natural language processing, referring image segmentation aims to segment the corresponding target region in the image according to the natural language description. With the maturity of related deep learning technology and the emergence of large-scale datasets, this task has attracted extensive attention of researchers. In this paper, we describe the development of referring image segmentation. We first elaborate the existing methods including CNN-LSTM framework structure, the complex modular-based and graph-based method, and classify them into two categories according to the encoding and decoding methods for multimodal information. Then, the mainstream datasets and common evaluation metrics that can be used in referring image segmentation are summarized. In addition, the performance differences between the existing referring image segmentation models are comprehensively compared through experiments. Finally, we discuss the shortcomings of the existing methods in this field and the future development direction, especially for the complex referring description, we need multi-step and explicit reasoning steps to solve the problem of image referring image segmentation.