融合注意力机制和连接时序分类的多模态手语识别

Modelling Long-Term Temporal Relationship and Spatial Attention for Multi-Modal Sign Language Recognition

  • 摘要: 连续手语识别的难点之一是手语数据中存在时空维度的冗余信息,以及手语数据与给定标签序列的对齐问题。因此,本文提出一种融合注意力机制和连接时序分类的连续手语识别模型,可以提取手语数据中彩色和深度视频片段的短期时空特征以及手部运动轨迹特征,将三种模态的特征融合后使用空间注意力加权并按照时间顺序输入到双向长短期记忆网络中进行时序建模,以获取长期时空特征,最后利用融合注意力机制和连接时序分类模型的解码网络以端到端的方式实现连续手语的准确识别。本模型在自行采集的中国手语数据集上进行测试,得到了高达0.935的准确率。

     

    Abstract: One of the difficulties in continuous sign language recognition is the redundant information in the spatio-temporal dimension of the sign language data, and the alignment of the sign language data with a given label sequence . Therefore, we propose a sign language sentence recognition model that combines attention mechanism and connected temporal classification, which can extract short-term spatio-temporal features of color and depth video segments and hand motion trajectories in sign language data. To obtain the long-term spatio-temporal features, the features of the three modals are fused and weighted using spatial attention, then input into the bidirectional long short term memory network in time sequence for time series modeling. Finally, decoder network that integrates the attention mechanism and the connection temporal classification model is used end-to-end to achieve accurate recognition of continuous sign language. This model was tested on a Chinese sign language data set collected by ourselves, and obtained a high accuracy rate 0.943.

     

/

返回文章
返回