基于扩散模型的注意力驱动RGB-D显著性目标检测

Attention-Driven RGB-D Salient Object Detection Based on Diffusion Model

  • 摘要: 显著性目标检测是计算机视觉领域的一个重要研究方向,旨在从复杂的背景中提取出人眼最为关注的区域。传统的RGB图像显著性目标检测方法仅依赖于图像的颜色信息,难以应对复杂场景中的多样性和干扰。为此,RGB-D显著性目标检测在传统RGB图像的基础上,额外引入了深度信息,从而能够更好地感知图像的空间结构,进而提高了显著性目标检测的性能。然而现有RGB-D显著性目标检测方法大多基于卷积神经网络或视觉Transformer,主要依靠判别式学习进行显著性目标检测,即通过对像素级显著性概率进行硬分类实现预测,往往存在模型过度自信的问题,这限制了现有方法在复杂场景下的检测性能。为了应对上述问题,本文提出了一种基于扩散模型的注意力驱动RGB-D显著性目标检测方法,利用扩散模型的渐进式加噪和逐步去噪过程,以生成的方式有效优化了预测结果,减少了模型过度自信导致的错误估计风险,提升了网络在复杂场景下的检测性能。首先,本文采用金字塔形视觉Transformer主干分别对RGB图像和深度图进行四个层级的特征提取;随后,通过提出的双流注意力融合模块实现对对应特征层级的两种跨模态特征的充分融合,接着通过渐进式融合模块对四个不同层级的融合后特征进行融合;最后,把它作为条件信息注入到去噪网络中对扩散模型的输出进行条件约束,并生成预测的显著性图。实验结果表明,所提出的方法在DUT、LFSD、NJU2K、NLPR、SIP、SSD和STERE这七个公开基准数据集上的多个指标均优于现有主流方法,证明了本文提出方法的有效性。

     

    Abstract: Salient object detection is an important research direction in computer vision, aiming to extract the regions that the human eye pays the most attention to from complex backgrounds. Traditional RGB salient detection methods relied only on the image’s color information and had difficulty dealing with the diversity and interference in complex scenes. Therefore, RGB-D salient object detection, based on traditional RGB images, additionally introduced depth information, thereby enabling the perception of the spatial structure of the image better and further improving the performance of salient object detection. However, most existing RGB-D salient object detection methods are based on convolutional neural networks or vision Transformers, mainly relying on discriminative learning for salient object detection, that is, achieving prediction by hard classification of pixel-level saliency probabilities. There is often the problem of model overconfidence, which limits the detection performance of existing methods in complex scenes. To address the above problems, this paper proposes an attention-driven RGB-D salient object detection method based on the diffusion model. By using the progressive noise addition and stepwise denoising processes of the diffusion model, the prediction results were effectively optimized in a generative manner, reducing the risk of incorrect estimation caused by the overconfidence of the model and improving the detection performance of the network in complex scenarios. Firstly, this paper adopted the Pyramid Vision Transformer to achieve four-level feature extraction for RGB images and depth maps. Then, the proposed dual-stream attention fusion module was used to fully fuse the features of the two modes corresponding to the feature level. Subsequently, the fusion features at four different levels were fused through the progressive fusion module to achieve feature fusion. Finally, they were injected into the denoising network as conditional information to impose conditional constraints on the output of the diffusion model and generate the predicted saliency map. The experimental results show that the proposed method outperforms existing mainstream methods in multiple metrics on seven public benchmark datasets, namely DUT, LFSD, NJU2K, NLPR, SIP, SSD, and STERE, which proves the effectiveness of the proposed method.

     

/

返回文章
返回