HU Zhengping, WANG Yulu, ZHANG Qiming, et al. A deep metric learning skeleton-based one-shot human action recognition algorithm incorporating multidimensional perception and spatial disentanglement[J]. Journal of Signal Processing, 2025, 41(4): 683-693. DOI: 10.12466/xhcl.2025.04.009.
Citation: HU Zhengping, WANG Yulu, ZHANG Qiming, et al. A deep metric learning skeleton-based one-shot human action recognition algorithm incorporating multidimensional perception and spatial disentanglement[J]. Journal of Signal Processing, 2025, 41(4): 683-693. DOI: 10.12466/xhcl.2025.04.009.

A Deep Metric Learning Skeleton-Based One-Shot Human Action Recognition Algorithm Incorporating Multidimensional Perception and Spatial Disentanglement

  • ‍ ‍Skeleton-based human action recognition methods have gained significant attention due to their ability to simplify training by removing action-independent visual information. However, collecting and annotating extensive skeleton data remains a challenging task. Skeleton-based One-Shot Action Recognition (SOAR) aims to identify human actions using only a single training sample, which enhances robots capability to interact with humans by allowing them to respond effectively to new action categories. To tackle the issue of data scarcity in human activity classification using Convolutional Neural Network (CNN) encoders, we approached the SOAR problem with a focus on compact skeleton sequence representations and a Deep Metric Learning (DML) framework. We reexamined the modeling of skeleton dynamic images for transitioning to novel activity categories using a self-attention transformer mechanism and spatial disentanglement constraints. This led to the development of a deep metric learning skeleton-based algorithm for one-shot human action recognition that integrates multidimensional perception and spatial disentanglement. For data preprocessing, we transformed 3D skeleton sequence coordinates into compact image representations. Feature extraction involved projecting the input into a low-dimensional feature space with a backbone network to capture essential action features. An Integrated Multilayer Perceptron and Transformer Embedding Encoder (MLP-TransEmbedder) was designed to learn the spatial-temporal dependencies of joint movements, enhancing the model’s capacity to perceive spatiotemporal information and generate high-level multidimensional embedded features. To measure similarity between samples, we utilized a nearest neighbor search approach for distance metrics. The model optimization process combined Multi-Similarity Loss (MSL), Triplet Margin Loss (TML), Cross-Entropy Loss (CEL), and Spatial Disentanglement Loss (SDL) to encourage the learning of sparser and more interpretable feature representations. MSL specifically addressed the challenge of effectively distinguishing short-distance negative samples from long-distance positive samples in the embedded space, allowing for improved similarity mining between paired samples. TML aimed to separate samples by minimizing inter-class dispersion while maximizing intra-class compactness, thereby enhancing performance on metric learning tasks. CEL facilitated nearest neighbor searching, leading to more accurate action classifications, whereas SDL, with its full rank space representation and rank maximization constraints, promoted the learning of independent feature representations, thereby increasing the diversity of skeleton representations even in single-sample scenarios. The two losses, CEL and SDL, were effectively integrated into the total loss, guiding the model to enhance both classification accuracy and feature differentiation during training. Experiments conducted on the publicly available NTU RGB+D 120 dataset for one-shot action recognition benchmarks showed that our proposed algorithm outperformed Skeleton-DML by 3.8% when trained on the full category set and by 7.5% when trained on 40 categories. The experimental results indicate that, despite data scarcity, our one-shot action classification algorithm successfully extracts spatiotemporal discriminative information from compact image representations of skeletal actions. By incorporating multidimensional perception and spatial disentanglement, our deep metric learning model for one-shot human action recognition effectively leverages the compact representations of skeletal sequences, optimizes both the embedding and classification processes, and ultimately enhances the accuracy and robustness of recognizing one-shot human actions.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return