时空特征校准和相关性解耦的骨架细粒度动作识别模型
Skeleton Fine-Grained Action Recognition Model with Spatio-Temporal Feature Calibration and Correlation Decoupling
-
摘要: 针对骨架数据丢弃部分重要信息导致相关动作识别模型较难捕获动作序列的细微变化从而造成对细粒度动作类别识别率不足的问题,本文提出时空特征校准和相关性解耦的骨架细粒度动作识别模型,在主干网络中间层和输出层中分别引入时空特征校准模块和相关性解耦模块,提高模型对细粒度动作类别识别能力。首先,为动态校准特征空间中容易被错误分类的样本同时提高模型对时空信息表征能力,引入时空特征校准模块,该模块对特征进行时空解耦以丰富特征空间中的时空信息,并且利用对比学习方法在特征空间中动态发现并纠正被错误分类的细粒度模糊样本;随后,为降低特征相似性对最终分类造成的影响,引入相关性解耦模块,该模块在第一阶段强制所有特征样本彼此远离以达到去相关目的,并且在第二阶段使去相关后的特征与相应类别原型聚合进而使最终分类难度降低。本文所引入对比学习模块仅在模型训练阶段参与计算,不会对测试阶段带来计算负担。为验证模型有效性,本文在大型公开骨架动作识别数据集上进行实验,模型在NTU RGB+D的X-Sub和X-View基准上识别准确率分别达到92.6%和96.8%,在NTU RGB+D 120数据集的X-Sub和X-Set基准上识别准确率分别达到89.2%和90.7%,相比于主流骨架动作识别模型有明显提升,实验结果表明,本方法能够提高模型对细粒度动作类别识别能力,具有一定优势。Abstract: Video action recognition based on human skeleton data has been rapidly developed owing to the lightweight and advanced features of skeleton data. However, skeleton data loses important object interaction information during the extraction process. Additionally, capturing the subtle changes of action sequences by using the traditional skeleton action recognition model is considerably difficult, resulting in an insufficient recognition rate of fine-grained action categories. To address this challenge, we devised a spatio-temporal feature calibration and correlation decoupling skeleton fine-grained action recognition model to improve the recognition ability of existing models for fine-grained action categories. We combined a backbone network with contrastive learning to achieve the dynamic calibration of hidden layer features, and performed relevance decoupling of similar features at the classification layer to ensure a simpler final classification. Specifically, to dynamically calibrate easily misclassified samples in the feature space and improve the model’s capability to characterize spatio-temporal information, a spatio-temporal feature calibration module was devised. This module spatio-temporally decouples the features to enrich the spatio-temporal information and dynamically discovers and corrects the misclassified fine-grained fuzzy samples in the feature space by using comparative learning. Subsequently, to reduce the feature similarity impact on the final classification, a correlation decoupling module was incorporated. This forced all feature samples to be sufficiently distanced from each other in the first stage to achieve de-correlation. In the second stage, the de-correlated features were aggregated with the corresponding prototypes to simplify the final classification. In addition, the proposed comparative learning module was solely involved in computation during the model training phase; therefore, additional computational burden was not introduced to the testing phase. To verify the effectiveness of the model, experiments were conducted on two large publicly available skeleton action recognition datasets: NTU RGB+D and NTU RGB+D 120, which have been widely used in the skeleton action recognition field. These datasets contain 60 and 120 fine-grained categories, respectively, such as “drink water” and “eat meal”, which are used to evaluate the model’s capability to recognize fine-grained categories. First, we conducted a significant number of ablation experiments to prove the effectiveness of various parts of the proposed method, which can adapt to a variety of GCN-based backbone networks with good generalizability. Second, the superiority of the proposed method was more intuitively demonstrated by visualizing and analyzing the experimental results. Finally, the results confirm that the proposed method achieved recognition accuracies of 92.6% and 96.8% on the NTU RGB+D X-Sub and X-View benchmarks, respectively, and 89.2% and 90.7% on the NTU RGB+D 120 X-Sub and X-Set benchmarks, respectively. This indicates a significant improvement compared to mainstream skeleton action recognition models. Therefore, the experimental results demonstrate the capability of the proposed method to improve the model’s performance in recognizing fine-grained action categories.