基于异构多流网络的多模态人体动作识别

Multimodal Human Action Recognition Based on Heterogeneous Multi-Stream Network

  • 摘要: 人体动作识别在人机交互、视频内容检索等领域有众多应用,是多媒体信息处理的重要研究方向。现有的大多数基于双流网络进行动作识别的方法都是在双流上使用相同的卷积网络去处理RGB与光流数据,缺乏对多模态信息的利用,容易造成网络冗余和相似性动作误判问题。近年来,深度视频也越来越多地用于动作识别,但是大多数方法只关注了深度视频中动作的空间信息,没有利用时间信息。为了解决这些问题,本文提出一种基于异构多流网络的多模态动作识别方法。该方法首先从深度视频中获取动作的时间特征表示,即深度光流数据,然后选择合适的异构网络来进行动作的时空特征提取与分类,最后对RGB数据、RGB中提取的光流、深度视频和深度光流识别结果进行多模态融合。通过在国际通用的大型动作识别数据集NTU RGB+D上进行的实验表明,所提方法的识别性能要优于现有较先进方法的性能。

     

    Abstract: Human action recognition has many applications in the fields of human-computer interaction and video content retrieval, and is an important research direction of multimedia information processing. Most existing methods for action recognition based on two-stream network use the same convolutional network to process RGB and optical flow data on two streams, lacking the use of multimodal information, which is easy to cause network redundancy and misjudgment of similar actions. In recent years, depth video is also increasingly used for action recognition, but most methods only focus on the spatial information of the action in the depth video, without using temporal information. To solve these problems, a multimodal action recognition method based on heterogeneous multi-stream network is proposed. Firstly, the temporal features of the action in the depth video are obtained, namely the depth optical flow. Then appropriate heterogeneous networks are selected for spatiotemporal feature extraction and classification of actions. Finally, multimodal fusion is performed on RGB, the optical flow extracted from RGB, Depth and the optical flow extracted from Depth. Experiments on the international dataset NTU RGB+D show that the performance of the proposed method in human action recognition is better than that of the existing advanced model.

     

/

返回文章
返回