‍ HU Zhengping, WANG Xinyu, CHEN Daiping, et al. Skeleton fine-grained action recognition model with spatio-temporal feature calibration and correlation decoupling[J]. Journal of Signal Processing, 2024, 40(12): 2178-2192. DOI: 10.12466/xhcl.2024.12.007.
Citation: ‍ HU Zhengping, WANG Xinyu, CHEN Daiping, et al. Skeleton fine-grained action recognition model with spatio-temporal feature calibration and correlation decoupling[J]. Journal of Signal Processing, 2024, 40(12): 2178-2192. DOI: 10.12466/xhcl.2024.12.007.

Skeleton Fine-Grained Action Recognition Model with Spatio-Temporal Feature Calibration and Correlation Decoupling

  • ‍ ‍Video action recognition based on human skeleton data has been rapidly developed owing to the lightweight and advanced features of skeleton data. However, skeleton data loses important object interaction information during the extraction process. Additionally, capturing the subtle changes of action sequences by using the traditional skeleton action recognition model is considerably difficult, resulting in an insufficient recognition rate of fine-grained action categories. To address this challenge, we devised a spatio-temporal feature calibration and correlation decoupling skeleton fine-grained action recognition model to improve the recognition ability of existing models for fine-grained action categories. We combined a backbone network with contrastive learning to achieve the dynamic calibration of hidden layer features, and performed relevance decoupling of similar features at the classification layer to ensure a simpler final classification. Specifically, to dynamically calibrate easily misclassified samples in the feature space and improve the model’s capability to characterize spatio-temporal information, a spatio-temporal feature calibration module was devised. This module spatio-temporally decouples the features to enrich the spatio-temporal information and dynamically discovers and corrects the misclassified fine-grained fuzzy samples in the feature space by using comparative learning. Subsequently, to reduce the feature similarity impact on the final classification, a correlation decoupling module was incorporated. This forced all feature samples to be sufficiently distanced from each other in the first stage to achieve de-correlation. In the second stage, the de-correlated features were aggregated with the corresponding prototypes to simplify the final classification. In addition, the proposed comparative learning module was solely involved in computation during the model training phase; therefore, additional computational burden was not introduced to the testing phase. To verify the effectiveness of the model, experiments were conducted on two large publicly available skeleton action recognition datasets: NTU RGB+D and NTU RGB+D 120, which have been widely used in the skeleton action recognition field. These datasets contain 60 and 120 fine-grained categories, respectively, such as “drink water” and “eat meal”, which are used to evaluate the model’s capability to recognize fine-grained categories. First, we conducted a significant number of ablation experiments to prove the effectiveness of various parts of the proposed method, which can adapt to a variety of GCN-based backbone networks with good generalizability. Second, the superiority of the proposed method was more intuitively demonstrated by visualizing and analyzing the experimental results. Finally, the results confirm that the proposed method achieved recognition accuracies of 92.6% and 96.8% on the NTU RGB+D X-Sub and X-View benchmarks, respectively, and 89.2% and 90.7% on the NTU RGB+D 120 X-Sub and X-Set benchmarks, respectively. This indicates a significant improvement compared to mainstream skeleton action recognition models. Therefore, the experimental results demonstrate the capability of the proposed method to improve the model’s performance in recognizing fine-grained action categories.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return