多维感知-空间解耦单样本人体动作识别模型
A Deep Metric Learning Skeleton-Based One-Shot Human Action Recognition Algorithm Incorporating Multidimensional Perception and Spatial Disentanglement
-
摘要: 基于骨骼数据的人体动作识别方法因其能够消除与动作无关的视觉信息来降低训练复杂性越来越受到人们关注,然而大规模骨骼动作数据收集和注释面临挑战,基于骨骼的单样本动作识别旨在仅用单个训练样本识别人体动作,可以使机器人对新颖动作类别积极反应改善人机交互。针对基于卷积神经网络编码器进行人类活动分类数据稀缺问题,考虑将单样本动作识别问题表述为骨骼序列紧凑表示和深度度量学习范式,基于自注意力Transformer机制和空间解耦约束重新审视骨骼动力学图像建模向新颖活动类别传输,提出多维感知-空间解耦单样本人体动作识别模型。首先,将3D骨骼序列坐标映射为紧凑图像表示;其次,基于骨干网络将输入投影到低维特征空间,提取初级动作特征;接着,设计融合多层感知机与Transformer的嵌入编码器,在嵌入空间中捕捉关节时间空间依赖关系,增强模型对时空信息感知能力,得到高层次多维嵌入特征;然后,基于最近邻搜索完成样本间相似性度量;最后,结合多相似性损失、三元组边界损失、交叉熵损失和空间解耦损失的混合深度度量学习优化模型。实验在公共大规模数据集NTU RGB+D 120上进行评估,提出方法较Skeleton-DML提高3.8%,在使用40个训练类别时较Skeleton-DML提高7.5%。研究表明,提出方法能够在数据稀缺情况下充分利用骨骼序列紧凑表示信息,提高单样本动作识别匹配精度。Abstract: Skeleton-based human action recognition methods have gained significant attention due to their ability to simplify training by removing action-independent visual information. However, collecting and annotating extensive skeleton data remains a challenging task. Skeleton-based One-Shot Action Recognition (SOAR) aims to identify human actions using only a single training sample, which enhances robots capability to interact with humans by allowing them to respond effectively to new action categories. To tackle the issue of data scarcity in human activity classification using Convolutional Neural Network (CNN) encoders, we approached the SOAR problem with a focus on compact skeleton sequence representations and a Deep Metric Learning (DML) framework. We reexamined the modeling of skeleton dynamic images for transitioning to novel activity categories using a self-attention transformer mechanism and spatial disentanglement constraints. This led to the development of a deep metric learning skeleton-based algorithm for one-shot human action recognition that integrates multidimensional perception and spatial disentanglement. For data preprocessing, we transformed 3D skeleton sequence coordinates into compact image representations. Feature extraction involved projecting the input into a low-dimensional feature space with a backbone network to capture essential action features. An Integrated Multilayer Perceptron and Transformer Embedding Encoder (MLP-TransEmbedder) was designed to learn the spatial-temporal dependencies of joint movements, enhancing the model’s capacity to perceive spatiotemporal information and generate high-level multidimensional embedded features. To measure similarity between samples, we utilized a nearest neighbor search approach for distance metrics. The model optimization process combined Multi-Similarity Loss (MSL), Triplet Margin Loss (TML), Cross-Entropy Loss (CEL), and Spatial Disentanglement Loss (SDL) to encourage the learning of sparser and more interpretable feature representations. MSL specifically addressed the challenge of effectively distinguishing short-distance negative samples from long-distance positive samples in the embedded space, allowing for improved similarity mining between paired samples. TML aimed to separate samples by minimizing inter-class dispersion while maximizing intra-class compactness, thereby enhancing performance on metric learning tasks. CEL facilitated nearest neighbor searching, leading to more accurate action classifications, whereas SDL, with its full rank space representation and rank maximization constraints, promoted the learning of independent feature representations, thereby increasing the diversity of skeleton representations even in single-sample scenarios. The two losses, CEL and SDL, were effectively integrated into the total loss, guiding the model to enhance both classification accuracy and feature differentiation during training. Experiments conducted on the publicly available NTU RGB+D 120 dataset for one-shot action recognition benchmarks showed that our proposed algorithm outperformed Skeleton-DML by 3.8% when trained on the full category set and by 7.5% when trained on 40 categories. The experimental results indicate that, despite data scarcity, our one-shot action classification algorithm successfully extracts spatiotemporal discriminative information from compact image representations of skeletal actions. By incorporating multidimensional perception and spatial disentanglement, our deep metric learning model for one-shot human action recognition effectively leverages the compact representations of skeletal sequences, optimizes both the embedding and classification processes, and ultimately enhances the accuracy and robustness of recognizing one-shot human actions.