Self-Supervised Visual Odometry with RGB-D Bimodal Mutual Guidance

SHI Baokun; ZHANG Hongliang; TIAN Zi’an; MA Wei; MI Qing; WU Lifang

doi:10.12466/xhcl.2026.03.009

SHI Baokun, ZHANG Hongliang, TIAN Zi’an, et al. Self-supervised visual odometry with RGB-D bimodal mutual guidanceJ. Journal of Signal Processing, 2026, 42(3): 398-408. DOI: 10.12466/xhcl.2026.03.009.

Citation:

Self-Supervised Visual Odometry with RGB-D Bimodal Mutual Guidance

Abstract

Abstract

Visual odometry， which estimates camera poses from image sequences， plays a vital role in robotic navigation， autonomous driving， and augmented reality. Self-supervised visual odometry has become a research focus for its independence from ground-truth pose data. It optimizes pose and depth estimation by constructing a self-supervised loss based on geometric consistency across views. A key challenge in this framework is how to design network architectures that fully exploit the complementary pose-related cues from both RGB images and depth maps. Existing methods often overlook the heterogeneous characteristics and complementary value of the two modalities， leading to insufficient cue utilization and limited pose estimation accuracy. To address this issue， this paper proposes a self-supervised visual odometry method with RGB-D bimodal mutual guidance， named BMG-VO. Specifically， an RGB-guided depth detail enhancement module is designed to incorporate texture and color priors from RGB images into the shallow layers of the depth encoding branch. This enhances the ability of depth features to capture fine details， such as edges and textures， thereby improving the robustness of feature matching. Meanwhile， a depth-guided RGB semantic enhancement module is introduced to reinforce the high-level features of the RGB encoding branch with geometric structure and intra-class consistency cues derived from depth maps. This increases robustness against illumination variations and provides more reliable matching features for pose regression. Additionally， a unimodal filtering module is employed to highlight the most essential pose-related cues within each individual modality. Extensive experiments on the KITTI dataset demonstrate that BMG-VO achieves higher accuracy in pose estimation compared to state-of-the-art self-supervised methods while also attaining excellent depth estimation performance.

FullText(HTML)

References (34)

Cited By

Self-Supervised Visual Odometry with RGB-D Bimodal Mutual Guidance

Abstract

Catalog

Export File

Citation

Format

Content