Abstract:
One of the difficulties in continuous sign language recognition is the redundant information in the spatio-temporal dimension of the sign language data, and the alignment of the sign language data with a given label sequence . Therefore, we propose a sign language sentence recognition model that combines attention mechanism and connected temporal classification, which can extract short-term spatio-temporal features of color and depth video segments and hand motion trajectories in sign language data. To obtain the long-term spatio-temporal features, the features of the three modals are fused and weighted using spatial attention, then input into the bidirectional long short term memory network in time sequence for time series modeling. Finally, decoder network that integrates the attention mechanism and the connection temporal classification model is used end-to-end to achieve accurate recognition of continuous sign language. This model was tested on a Chinese sign language data set collected by ourselves, and obtained a high accuracy rate 0.943.