具有特征交互适应的3D双手网格重建方法
Three-Dimensional Two-Hand Mesh-Reconstruction Method with Feature Interaction Adaptation
-
摘要: 从单张RGB图像中实现双手的3D交互式网格重建是一项极具挑战性的任务。由于双手之间的相互遮挡以及局部外观相似性较高,导致部分特征提取不够准确,从而丢失了双手之间的交互信息并使重建的手部网格与输入图像出现不对齐等问题。为了解决上述问题,本文首先提出一种包含两个部分的特征交互适应模块,第一部分特征交互在保留左右手分离特征的同时生成两种新的特征表示,并通过交互注意力模块捕获双手的交互特征;第二部分特征适应则是将此交互特征利用交互注意力模块适应到每只手,为左右手特征注入全局上下文信息。其次,引入三层图卷积细化网络结构用于精确回归双手网格顶点,并通过基于注意力机制的特征对齐模块增强顶点特征和图像特征的对齐,从而增强重建的手部网格和输入图像的对齐。同时提出一种新的多层感知机结构,通过下采样和上采样操作学习多尺度特征信息。最后,设计相对偏移损失函数约束双手的空间关系。在InterHand2.6M数据集上的定量和定性实验表明,与现有的优秀方法相比,所提出的方法显著提升了模型性能,其中平均每关节位置误差(Mean Per Joint Position Error,MPJPE)和平均每顶点位置误差(Mean Per Vertex Position Error,MPVPE)分别降低至7.19 mm和7.33 mm。此外,在RGB2Hands和EgoHands数据集上进行泛化性实验,定性实验结果表明所提出的方法具有良好的泛化能力,能够适应不同环境背景下的手部网格重建。Abstract: Achieving three-dimensional interactive hand-mesh reconstruction from a single RGB image is extremely challenging. This difficulty arises owing to several factors, such as the occlusion between two hands and the high local-appearance similarity between the hands, which typically result in inaccurate feature extraction. This causes interaction-information loss between the hands, thus resulting in misalignment between the reconstructed hand mesh and the input image. Hence, this study proposes a two-part feature-interaction and adaptation module. The first part, which entailed feature interaction, preserved the separate features of the left and right hands while generating two new feature representations and then captured the interactive features between the two hands through an interaction attention module. The second part, which entailed feature adaptation, employed the interaction attention module to adapt these interaction features to each hand, thereby injecting global contextual information into the features of both hands. To further improve the accuracy of the hand-mesh reconstruction, we introduced a three-layer graph convolution refinement network that functioned in a coarse-to-fine manner. This network was designed to precisely regress the vertices of the hand mesh, thus progressively refining the details of the hand shapes. Additionally, a feature-alignment module based on an attention mechanism was incorporated to enhance the alignment between the vertex features and image features. This module ensured that the reconstructed hand mesh aligns well with the input image, thereby improving the visual accuracy of the reconstruction. Furthermore, we proposed a novel multilayer perceptron structure that learned multiscale feature information. This structure was designed to learn hierarchical features via downsampling and upsampling operations, thus allowing the model to capture both fine-grained and global information across different scales. Additionally, it allowed the model to address variations more effectively during the appearance of the hands and their interactions. Finally, we introduced a relative offset loss function that constrained the spatial relationships between the two hands during the reconstruction process. This loss function allowed the model to maintain the correct relative positioning of the hands, thus ensuring that the reconstructed mesh respected the spatial configuration of the hands in the input image. We conducted extensive quantitative and qualitative experiments on the InterHand2.6M dataset to evaluate the proposed method. The results show that the proposed method significantly outperforms existing state-of-the-art methods, with the mean per joint position error and mean per vertex position error reduced to 7.19 and 7.33 mm, respectively. Additionally, we performed generalization experiments on the RGB2Hands and EgoHands datasets, which further showcased the excellent generalization capability of the proposed method. The qualitative results indicate that the proposed method can effectively adapt to various environmental contexts, thus achieving high-quality hand-mesh reconstructions in diverse scenarios.