视频动作骨骼描述空间时空联合对齐小样本分类算法
Few-shot Classification Using Video Action Skeleton Description Space Spatial-Temporal Joint Alignment
-
摘要: 与大数据人体动作识别相比,小样本动作识别旨在从很少标记样本中学习判别特征来识别新颖动作类别,但基于视频RGB特征描述方法中动作信息容易被与任务无关的背景、亮度和颜色变化所混淆。针对动作样本标注困难,RGB视频数据环境适应性差、数据维度高问题,考虑将信息表示高效、可解释性强的骨骼描述数据与小样本学习结合提出视频动作骨骼描述空间时空联合对齐小样本分类算法。模型整体基于原型网络思想,将原始输入映射到嵌入空间中计算原型表示,并使用度量方式实现查询样本预测。特征提取时,设计时空联合注意图卷积网络作为特征编码骨干,首先对输入骨骼序列构造时空图,接着进行多层次时空图卷积及时空联合注意激活,得到对应高层次嵌入特征,其中时空联合注意模块在时空两个维度上对不同动作阶段的骨骼和关节重要度进行加权,以增强模型提取判别特征的能力;距离度量时,通过图匹配方式得到查询骨架图与支持骨架图之间的欧式距离,然后基于动态时间规整算法动态规划两个动作序列间最优匹配,计算得到骨架图对距离累积,从而增强时空特征对齐,最后通过查找最近距离以进行度量和分类。在NTU-T、NTU-S和Kinetics三个骨骼基准上的实验表明,提出算法能够充分利用人体骨骼信息,提高小样本动作识别匹配精度。Abstract: Compared with big data human action recognition, few-shot action recognition can quickly adapt to novel action categories and classify unlabeled actions by learning discriminative features with few labeled action samples. Therefore, the action recognition algorithm based on few-shot learning has gained increasing interest and is widely studied. However, the action information in the video RGB feature description method is easily confused with irrelevant background, brightness, and color changes. Aiming at the problems of difficulty in labeling action samples, poor environmental adaptability of RGB video data, and high data dimension, this study considers the combination of efficient information representation and interpretable skeleton description data with few-shot learning to propose an algorithm for video action skeleton description space spatial-temporal joint alignment few-shot classification. The model is based on the concept of the Prototype Network (ProtoNet), which maps an original input to the embedding space to calculate the prototype representation and uses the measurement method for query sample prediction. In the feature extraction of the skeleton sequence, the Spatial-Temporal Joint Attention Graph Convolutional Network (STJA-GCN) is designed as the feature coding backbone. First, the spatial-temporal graph is constructed for the input skeleton sequence, and then the multilevel spatial-temporal graph convolution and spatial-temporal joint attention activation are conducted to obtain the corresponding high-level embedded feature representation. The Spatial-Temporal Joint Attention (ST-JointAtt) module weighs the importance of bones and joints of different action stages in spatial-temporal dimensions and adaptively focuses on activated key action information to enhance the ability of the model to extract discriminative features. In the distance measurement, the Euclidean distance between the query and support skeleton graphs is obtained through graph matching. Subsequently, the Dynamic Time Warping (DTW) algorithm is used to simulate the time series stretching and shrinking operations, the optimal matching between the two action sequences is dynamically scheduled, and the distance accumulation of the skeleton graph pair is calculated to enhance the alignment of spatial-temporal features. Experiments were performed on three skeleton benchmarks: NTU-T, NTU-S, and Kinetics. The experimental results show that the proposed algorithm can fully utilize human skeleton information and further improve the matching accuracy of few-shot action recognition.