轻量化人体和手部网格重建
A Lightweight Method for Human Body and Hand Mesh Reconstruction
-
摘要: 三维人体网格重建在影视、虚拟现实等下游任务中有广泛应用。然而现有重建方式关注更好的重建精度和纹理表达,也因此更依赖高性能的计算或采集设备,缺乏对低成本、轻量化重建方式的研究。为降低人体重建任务的使用成本和硬件要求,本文提出了一种轻量化的人体和手部网格重建方式,基于参数化模型对身体和手部重建任务进行解耦,针对身体和手部的不同特点分别设计了不同分支网络。身体重建分支和手部重建分支均为编码器-解码器结构。身体重建分支编码器为双阶段编码器,第一阶段通过Litehrnet和Canny算子获得热点图和边缘图,并对图片进行代理表示,第二阶段通过Shufflenet提取全局特征,解码器通过级联低维度多层感知器以概率分布的方式对人体参数进行回归;手部重建分支的编码器以Litehrnet为主干网络获取多分辨率特征分支,通过姿态池化对多分辨率特征分支进行融合得到全局特征,解码器通过深度可分离卷积网络获得手部顶点,并通过MLP对形状进行估计,利用顶点坐标基于逆向拓扑数学求解得到关节旋转参数。与现有方法相比,参数量和计算量显著减少,整体参数量为6.12M,计算量为433M,且具有较好的重建效果,在Human3.6M数据集中平均关节点误差(MPJPE)为86.7 mm,手部重建分支在FreiHand数据集上对齐后平均关节点误差(PA-MPJPE)为10.8 mm。此外该方法完成了在移动设备的推理,在骁龙8Gen3处理器推理速度为79.7 ms(12.5 fps),可以达到实时推理的效果。Abstract: The use of 3D human body reconstruction shows substantial potential across various domains, including film and television production and virtual reality. Notably, the prevailing reconstruction methodologies predominantly emphasize the refinement of reconstruction accuracy and texture articulation, often necessitating high-performance computing or sophisticated acquisition apparatus. Nonetheless, the current landscape exhibits a dearth of investigations into cost-effective and lightweight reconstruction techniques. In response to the imperative to alleviate usage costs and hardware requisites associated with human body reconstruction, this study proposes a strategy that entails the disentanglement of body and hand components grounded in a parameterized human body model. Subsequently, distinct reconstruction networks have been tailored to accommodate the distinctive movement characteristics of the body and hands, offering a judicious balance between computational parsimony and performance robustness. Both the body and hand reconstruction modules adopt an encoder-decoder architecture. The encoder segment of the body reconstruction module features a dual-stage design. Initially, leveraging Litehrnet and Canny edge algorithms, we derive heatmaps and edge maps, which serve as surrogate representations for RGB images, facilitating the acquisition of preliminary features through downsampling and concatenation. Because of the challenges of directly extracting adequate features from RGB images via lightweight backbone networks, the images are represented using edge maps and heatmaps. Subsequently, global features are procured in the second stage by Shufflenet. To improve performance, the activation function has been modified. To reduce parameter count while ensuring reconstruction efficacy, low-dimensional MLPs are used to estimate parameters based on probability distributions. Shape parameters are derived via a single MLP based on the Gaussian distribution, and pose parameters are estimated sequentially for each joint point utilizing cascaded low-dimensional MLPs guided by the Fisher matrix distribution. For the hand reconstruction branch, reconstruction is conducted based on vertex regression, and parameters are obtained via hand vertices. Conversely, the encoder of the hand reconstruction branch employs Litehrnet to yield multi-resolution feature branches. Although high-resolution features coupled with shallow features exhibit enhanced granularity expression and low-resolution features afford superior global perception, we employ interpolation for pose pooling and fuse high- and low-resolution features to reconcile these disparate characteristics. Subsequently, the decoder employs a DSConv and upsample network to derive hand vertices. Shape parameters are estimated via MLP based on hand vertices, and joint rotation parameters are derived from vertex coordinates employing inverse topology mathematics. Compared to extant methodologies, the proposed method yields a notable reduction in parameter and computational requisites, with an overall parameter count of 6.12M and a computational load of 433M. Evaluation of the Human3.6M dataset showcases an MPJPE of 86.7 mm for the body reconstruction branch, outperforming the classical method HMR (88.0 mm) with a parameter size representing only 11.6% of HMR. Moreover, the reconstructed mesh PA-MPJPE of 10.8 mm for the hand reconstruction branch surpasses regression-based full-body reconstruction methods such as ExPose and PIXIE, with parameter quantities of 4.7% and 3.1%, respectively. Furthermore, deployment on mobile devices for real-time inference, facilitated by Android Studio and PyTorch Android, yields an inference speed of 79.7 ms (12.5 fps) on Snapdragon 8Gen3, thereby meeting the exigencies of real-time inference applications.