Attention-Driven RGB-D Salient Object Detection Based on Diffusion Model
-
Abstract
Salient object detection is an important research direction in computer vision, aiming to extract the regions that the human eye pays the most attention to from complex backgrounds. Traditional RGB salient detection methods relied only on the image’s color information and had difficulty dealing with the diversity and interference in complex scenes. Therefore, RGB-D salient object detection, based on traditional RGB images, additionally introduced depth information, thereby enabling the perception of the spatial structure of the image better and further improving the performance of salient object detection. However, most existing RGB-D salient object detection methods are based on convolutional neural networks or vision Transformers, mainly relying on discriminative learning for salient object detection, that is, achieving prediction by hard classification of pixel-level saliency probabilities. There is often the problem of model overconfidence, which limits the detection performance of existing methods in complex scenes. To address the above problems, this paper proposes an attention-driven RGB-D salient object detection method based on the diffusion model. By using the progressive noise addition and stepwise denoising processes of the diffusion model, the prediction results were effectively optimized in a generative manner, reducing the risk of incorrect estimation caused by the overconfidence of the model and improving the detection performance of the network in complex scenarios. Firstly, this paper adopted the Pyramid Vision Transformer to achieve four-level feature extraction for RGB images and depth maps. Then, the proposed dual-stream attention fusion module was used to fully fuse the features of the two modes corresponding to the feature level. Subsequently, the fusion features at four different levels were fused through the progressive fusion module to achieve feature fusion. Finally, they were injected into the denoising network as conditional information to impose conditional constraints on the output of the diffusion model and generate the predicted saliency map. The experimental results show that the proposed method outperforms existing mainstream methods in multiple metrics on seven public benchmark datasets, namely DUT, LFSD, NJU2K, NLPR, SIP, SSD, and STERE, which proves the effectiveness of the proposed method.
-
-