Three-Dimensional Two-Hand Mesh-Reconstruction Method with Feature Interaction Adaptation
-
Graphical Abstract
-
Abstract
Achieving three-dimensional interactive hand-mesh reconstruction from a single RGB image is extremely challenging. This difficulty arises owing to several factors, such as the occlusion between two hands and the high local-appearance similarity between the hands, which typically result in inaccurate feature extraction. This causes interaction-information loss between the hands, thus resulting in misalignment between the reconstructed hand mesh and the input image. Hence, this study proposes a two-part feature-interaction and adaptation module. The first part, which entailed feature interaction, preserved the separate features of the left and right hands while generating two new feature representations and then captured the interactive features between the two hands through an interaction attention module. The second part, which entailed feature adaptation, employed the interaction attention module to adapt these interaction features to each hand, thereby injecting global contextual information into the features of both hands. To further improve the accuracy of the hand-mesh reconstruction, we introduced a three-layer graph convolution refinement network that functioned in a coarse-to-fine manner. This network was designed to precisely regress the vertices of the hand mesh, thus progressively refining the details of the hand shapes. Additionally, a feature-alignment module based on an attention mechanism was incorporated to enhance the alignment between the vertex features and image features. This module ensured that the reconstructed hand mesh aligns well with the input image, thereby improving the visual accuracy of the reconstruction. Furthermore, we proposed a novel multilayer perceptron structure that learned multiscale feature information. This structure was designed to learn hierarchical features via downsampling and upsampling operations, thus allowing the model to capture both fine-grained and global information across different scales. Additionally, it allowed the model to address variations more effectively during the appearance of the hands and their interactions. Finally, we introduced a relative offset loss function that constrained the spatial relationships between the two hands during the reconstruction process. This loss function allowed the model to maintain the correct relative positioning of the hands, thus ensuring that the reconstructed mesh respected the spatial configuration of the hands in the input image. We conducted extensive quantitative and qualitative experiments on the InterHand2.6M dataset to evaluate the proposed method. The results show that the proposed method significantly outperforms existing state-of-the-art methods, with the mean per joint position error and mean per vertex position error reduced to 7.19 and 7.33 mm, respectively. Additionally, we performed generalization experiments on the RGB2Hands and EgoHands datasets, which further showcased the excellent generalization capability of the proposed method. The qualitative results indicate that the proposed method can effectively adapt to various environmental contexts, thus achieving high-quality hand-mesh reconstructions in diverse scenarios.
-
-