Abstract:
Human action recognition has many applications in the fields of human-computer interaction and video content retrieval, and is an important research direction of multimedia information processing. Most existing methods for action recognition based on two-stream network use the same convolutional network to process RGB and optical flow data on two streams, lacking the use of multimodal information, which is easy to cause network redundancy and misjudgment of similar actions. In recent years, depth video is also increasingly used for action recognition, but most methods only focus on the spatial information of the action in the depth video, without using temporal information. To solve these problems, a multimodal action recognition method based on heterogeneous multi-stream network is proposed. Firstly, the temporal features of the action in the depth video are obtained, namely the depth optical flow. Then appropriate heterogeneous networks are selected for spatiotemporal feature extraction and classification of actions. Finally, multimodal fusion is performed on RGB, the optical flow extracted from RGB, Depth and the optical flow extracted from Depth. Experiments on the international dataset NTU RGB+D show that the performance of the proposed method in human action recognition is better than that of the existing advanced model.