Abstract:
Monocular depth estimation is an important but ill-posed procedure in the process of scene geometry understanding. Though recent supervised learning methods have achieved promising results for monocular depth estimation, they require vast amounts of ground truth depth data which is a costly task. Besides, previous works suffer from well-known problems such as moving objects, occlusions and lighting, which result in unsatisfactory performance, particularly in object edges and low-texture regions. To tackle these problems, we propose a self-attention based multi-stage network for unsupervised monocular depth estimation. Our method incorporates the following features: 1) multi-stage network provides stronger constraint and supervision for depth estimation during training; 2) the network is optimized with mask weighted reconstruction loss and left-right disparity consistency loss; 3) self-attention module is adopted to capture more context information. Experimental results on the KITTI dataset show that the method can obtain state-of-the-art performance, which means the proposed method can effectively improve the performance of monocular depth estimation.