结合状态空间模型和Transformer的时空增强视频字幕生成

Spatiotemporal Enhancement of Video Captioning Integrating a State Space Model and Transformer

  • 摘要: 视频字幕生成 (Video Captioning) 旨在用自然语言描述视频中的内容,在人机交互、辅助视障人士、体育视频解说等领域具有广泛的应用前景。然而视频中复杂的时空内容变化增加了视频字幕生成的难度,之前的方法通过提取时空特征、先验信息等方式提高生成字幕的质量,但在时空联合建模方面仍存在不足,可能导致视觉信息提取不充分,影响字幕生成结果。为了解决这个问题,本文提出一种新颖的时空增强的状态空间模型和Transformer(SpatioTemporal-enhanced State space model and Transformer, ST2)模型,通过引入最近流行的具有全局感受野和线性的计算复杂度的Mamba (一种状态空间模型),增强时空联合建模能力。首先,通过将Mamba与Transformer并行结合,提出空间增强的状态空间模型(State Space Model, SSM)和Transformer (Spatial enHanced State space model and Transformer module, SH-ST),克服了卷积的感受野问题并降低计算复杂度,同时增强模型提取空间信息的能力。然后为了增强时间建模,我们利用Mamba的时间扫描特性,并结合Transformer的全局建模能力,提出时间增强的SSM和Transformer (Temporal enHanced State space model and Transformer module, TH-ST)。具体地,我们对SH-ST产生的特征进行重排序,从而使Mamba以交叉扫描的方式增强重排序后特征的时间关系,最后用Transformer进一步增强时间建模能力。实验结果表明,我们ST2模型中SH-ST和TH-ST结构设计的有效性,且在广泛使用的视频字幕生成数据集MSVD和MSR-VTT上取得了具有竞争力的结果。具体的,我们的方法分别在MSVD和MSR-VTT数据集上的绝对CIDEr分数超过最先进的结果6.9%和2.6%,在MSVD上的绝对CIDEr分数超过了基线结果4.9%。

     

    Abstract: ‍ ‍Video captioning aims to describe the content of videos using natural language, offering extensive applications in areas such as human-computer interaction, assistance for visually impaired individuals, and sports commentary. However, the complex spatiotemporal variations within videos make it challenging to generate accurate captions. Previous methods have attempted to enhance caption quality by extracting spatiotemporal features and leveraging prior information. Despite these efforts, they often struggle with spatiotemporal joint modeling, which can lead to inadequate visual information extraction and negatively impact the quality of generated captions. To address this challenge, we propose a novel model, ST2, which enhances spatiotemporal joint modeling capabilities by incorporating Mamba—a recently popular state-space model (SSM) known for its global receptive field and linear computational complexity. By combining Mamba with the Transformer framework, we introduce a Spatially Enhanced SSM and Transformer (SH-ST) that overcomes the receptive field limitations of convolutional approaches while reducing computational complexity, thereby improving the model’s ability to extract spatial information. To further strengthen temporal modeling, we utilize Mamba’s temporal scanning characteristics in conjunction with the global modeling capabilities of the Transformer. This results in a Temporally Enhanced SSM and Transformer (TH-ST). Specifically, the features generated by SH-ST are reordered to allow Mamba to enhance the temporal relationships of these rearrange features through cross-scanning, after which the Transformer is employed to further bolster temporal modeling capabilities. Experimental results validate the effectiveness of the SH-ST and TH-ST structural designs within our ST2 model, achieving competitive results on widely used video captioning datasets, MSVD and MSR-VTT. Notably, our method surpasses state-of-the-art results, achieving a 6.9% and 2.6% improvement in absolute CIDEr scores on the MSVD and MSR-VTT datasets, respectively, and exceeding baseline results by 4.9% in absolute CIDEr scores on MSVD.

     

/

返回文章
返回