时空关联的Transformer骨架行为识别

Space-Time-Correlated Transformer for Skeleton-Based Action Recognition

  • 摘要: 目前主流的骨架行为识别方法采取关节流、骨骼流及其对应的运动流作为多流网络分别进行训练,造成训练成本高,另外,在特征提取过程中,忽略了对复杂时空依赖关系的建模,以及在时域上的信息交流采取大尺度卷积,导致聚合大量冗余信息。针对以上问题,提出一种时空关联的Transformer骨架行为识别方法。首先,构建运动融合模块,以关节流和骨骼流作为双流输入,在特征级别将各自的运动信息进行融合,减少单独训练运动流的成本;其次,提出移位Transformer模块,利用时间移位操作混合时空信息的特性,配合Transformer低成本地捕获短期时空依赖关系;然后,设计多尺度时间卷积进行时域长期信息交流;最后,融合双流得分获得最终分类预测。在大规模数据集NTU RGB+D以及NTU RGB+D 120上进行实验,结果表明,该模型在NTU RGB+D数据集的两种评价标准X-Sub和X-View上分别达到了91.5%和96.3%的识别准确率,在NTU RGB+D 120数据集两种评价标准X-Sub和X-Set上分别达到了87.2%和89.3%的识别准确率,本文所提方法的识别准确率相对主流骨架行为识别方法有明显提升,验证了模型的有效性和通用性。

     

    Abstract: ‍ ‍At present, the most common skeleton action recognition methods adopt a joint stream, bone stream, and corresponding motion stream as multi-stream networks for separate training operations, which results in high training costs. In addition, the modeling of complex spatio-temporal dependencies is neglected in the feature extraction process, and large-scale convolution is adopted for the exchange of information in the temporal domain, leading to the aggregation of a large amount of redundant information. A space-time-correlated transformer skeleton action recognition method was investigated to address these problems. First, a motion fusion module was constructed to reduce the cost of training motion streams separately by using joint and skeletal streams as inputs and fusing the respective motion information at the feature level. Second, a shift transformer module was proposed, which used the characteristics of the temporal shift operation to mix spatio-temporal information with the transformer to capture the short-term spatio-temporal dependencies at a low cost. Then, a multiscale temporal convolution was designed for time-domain long-term information. Finally, the final classification prediction was obtained by fusing the two-stream scores. Experiments on the large-scale datasets NTU RGB+D and NTU RGB+D 120 showed that the model achieved recognition accuracies of 91.5% and 96.3% on the two evaluation standards X-Sub and X-View for the NTU RGB+D dataset, respectively; and 87.2% and 89.3% on the two evaluation standards X-Sub and X-Set for the NTU RGB+D 120 dataset, respectively. The recognition accuracy of the proposed method was significantly better than those of the most commonly used skeleton action recognition methods, which verified the effectiveness and generality of the model.

     

/

返回文章
返回