<b>RGB-D</b>双模态互引导的自监督视觉里程计

史宝坤; 章宏亮; 田子安; 马伟; 米庆; 毋立芳

doi:10.12466/xhcl.2026.03.009

摘要: 视觉里程计通过分析图像序列，估计每一帧对应的相机位姿，该项技术在机器人自主导航、自动驾驶系统及增强现实等场景中发挥着重要作用。自监督视觉里程计因其无需依赖位姿真值数据而成为当前研究热点，通过几何一致性原理实现跨视角图像合成与自监督损失构建，有效优化位姿和深度估计过程。在自监督视觉里程计框架下，如何设计网络结构以充分挖掘RGB图像和深度图双模态中蕴含的位姿相关线索是提升模型性能的关键。然而，现有算法对于两种模态的异质特性和互补价值考虑欠缺，导致双线索挖掘不充分，进而影响位姿估计精度。针对这一关键问题，本文提出RGB-D双模态互引导的自监督视觉里程计（Self-supervised Visual Odometry with RGB-D Bimodal Mutual Guidance， BMG-VO）。具体而言，设计RGB引导的深度细节增强模块，通过RGB图像的纹理先验增强深度编码分支的细节信息表达能力，使深度特征能有效捕获边缘、纹理等关键细节，从而提升特征匹配鲁棒性；同时，引入深度引导的RGB语义增强模块，利用深度图的几何信息为RGB编码分支补充类内一致性线索，提升其对抗光照污染等干扰的鲁棒性，为位姿回归提供更可靠的匹配依据。此外，设计单模态过滤模块，以突出单一模态中的位姿估计关键线索。在KITTI数据集上的丰富实验结果表明，与现有主流自监督视觉里程计方法相比，BMG-VO的位姿估计准确度更高，深度估计精度也达到了优异水平。

Abstract: Visual odometry， which estimates camera poses from image sequences， plays a vital role in robotic navigation， autonomous driving， and augmented reality. Self-supervised visual odometry has become a research focus for its independence from ground-truth pose data. It optimizes pose and depth estimation by constructing a self-supervised loss based on geometric consistency across views. A key challenge in this framework is how to design network architectures that fully exploit the complementary pose-related cues from both RGB images and depth maps. Existing methods often overlook the heterogeneous characteristics and complementary value of the two modalities， leading to insufficient cue utilization and limited pose estimation accuracy. To address this issue， this paper proposes a self-supervised visual odometry method with RGB-D bimodal mutual guidance， named BMG-VO. Specifically， an RGB-guided depth detail enhancement module is designed to incorporate texture and color priors from RGB images into the shallow layers of the depth encoding branch. This enhances the ability of depth features to capture fine details， such as edges and textures， thereby improving the robustness of feature matching. Meanwhile， a depth-guided RGB semantic enhancement module is introduced to reinforce the high-level features of the RGB encoding branch with geometric structure and intra-class consistency cues derived from depth maps. This increases robustness against illumination variations and provides more reliable matching features for pose regression. Additionally， a unimodal filtering module is employed to highlight the most essential pose-related cues within each individual modality. Extensive experiments on the KITTI dataset demonstrate that BMG-VO achieves higher accuracy in pose estimation compared to state-of-the-art self-supervised methods while also attaining excellent depth estimation performance.

RGB-D双模态互引导的自监督视觉里程计

Self-Supervised Visual Odometry with RGB-D Bimodal Mutual Guidance