基于时空Transformer的毫米波雷达三维人体姿态重构
Millimeter Wave Radar Based on Spatial-Temporal Transformer 3D Human Posture Reconstruction
-
摘要: 深度学习技术使得从毫米波雷达捕获的人体散射信号中精确提取人体运动特征并重构三维人体姿态成为可能。然而,目前毫米波雷达人体姿态重构常采用直接将雷达图像映射到三维关节点坐标的单阶段策略,这种跨域层级映射任务使得网络在重构精度、深度信息表达及姿态连贯性上面临挑战。针对这一问题,本文提出了一种基于时空Transformer的多阶段毫米波雷达三维人体姿态重构模型(Spatial-Temporal Pose Reconstruction Transformer, STPRT),通过两阶段策略处理提高重构精度:第一阶段,构建并行多分辨率子网络从水平和垂直雷达图像中提取多尺度的二维关节点信息和空间位置特征并进行融合,随后由全连接层生成二维人体姿态坐标;第二阶段,时空Transformer通过空间注意力模块对每帧中的二维关节坐标进行高维空间特征编码,时间注意力模块捕捉姿态特征在序列帧中的时间演变,增强姿态间的深度感知和空间准确性,实现从二维姿态到三维姿态的映射提升。此外,在训练过程中引入了指数移动平均(Exponential Moving Average, EMA)策略调整梯度下降,从而提升整体映射的精确度和连贯性。在毫米波雷达公开数据集RFSkeleton3D上的验证表明,相比现有的mm-Pose和RPM模型,本模型在减少参数量的同时,将平均关节位置误差降低至7.3 cm。Abstract: Deep learning technology facilitates the accurate extraction of human motion features and reconstruction of 3D poses by using millimeter wave (mm Wave) radar signals. However, current mm Wave radar human posture reconstruction frequently adopts a single-stage strategy, which involves directly mapping radar images to 3D joint coordinates. Implementation of this cross-domain hierarchical mapping task creates challenges for the network in terms of reconstruction accuracy, depth-information expression, and pose coherence To address this problem, this paper proposes a 3D human pose reconstruction model using multi-stage mm Wave radar, termed the spatial-temporal pose reconstruction transformer (STPRT), which improves reconstruction accuracy using a two-stage strategy. First, a parallel multi-resolution subnetwork is constructed to extract multi-scale 2D joint information and spatial position features from horizontal and vertical radar images and fuse them, after which the fully connected layer generates 2D human pose coordinates. Second, the spatial-temporal Transformer encodes the high-dimensional spatial features of the 2D joint coordinates in each frame using the spatial attention module. The temporal attention module captures the temporal evolution of pose features in the sequence frames, enhances the depth perception and spatial accuracy between poses, and improves the mapping process from the 2-3D pose. In addition, the exponential moving average (EMA) strategy is employed during the training process to adjust the gradient descent, thereby improving overall mapping accuracy and consistency. Verification using the mm Wave radar public dataset RFSkeleton3D demonstrate that, compared with existing mm-Pose and RF-based pose machine (RPM) models, the proposed model reduces the average joint position error to 7.3 cm and decreases the number of parameters.