用于嵌入式终端的改进ENet街景语义分割
Improved ENet Urban-scene Semantic Segmentation for Embedded Terminal
-
摘要: 现有的移动终端实时语义分割算法对图像细节特征的处理能力较差,空间特征丢失严重。针对上述问题,提出了一种融合不同层级空间特征的方法,基于改进的 ENet,在下采样层使用反向残差结构,增加网络计算过程中图像信息的获取,减少下采样造成的图像空间特征丢失。通过空间注意力对图像空间特征进行筛选,增强相关特征,削弱不相关特征。该方法将高分辨率的浅层空间特征与具有丰富语义信息的深层特征融合,提高了网络对图像细节特征的处理能力。实验表明,在 NVIDIA Jetson TX2、NVIDIA Jetson Xavier NX 及 NVIDIA Jetson Xavier AGX 等嵌入式终端上,所提出网络与现有算法相比,其性能在 Cityscapes 数据集上提高了 2.9%,在 CamVid 数据集上提高了 3.2%。Abstract: The existing real-time semantic segmentation algorithms for embedded terminals have weak processing capabilities for object detailed features. A method of fusing spatial features of different levels is proposed to crack the above nut. Based on the modified ENet, the inverted residual structure is used in the down-sampling layer to increase the image information acquisition in the network calculation process and decrease the loss of spatial image features caused by the down-sampling process. The spatial attention mechanism is used to weight the down-sampled image spatial feature information to enhance relevant features and weaken irrelevant features. This method connects the low-level high-resolution spatial features to the deep layers of the network. It merges them with in-depth semantic features, which improved the image detail processing capability of the network. Experiments on NVIDIA Jetson TX2, NVIDIA Jetson Xavier NX and NVIDIA Jetson Xavier AGX show that the proposed network runs the same speed as ENet. The mean Intersection of Union (mIoU) on the Cityscapes is increased by 2.9%, and the mIoU on the CamVid is improved by 3.2%.