SUN Haoying, LI Shuyi, XI Zeyu, et al. Spatiotemporal enhancement of video captioning integrating a state space model and Transformer[J]. Journal of Signal Processing, 2025, 41(2): 279-289. DOI: 10.12466/xhcl.2025.02.007.
Citation: SUN Haoying, LI Shuyi, XI Zeyu, et al. Spatiotemporal enhancement of video captioning integrating a state space model and Transformer[J]. Journal of Signal Processing, 2025, 41(2): 279-289. DOI: 10.12466/xhcl.2025.02.007.

Spatiotemporal Enhancement of Video Captioning Integrating a State Space Model and Transformer

  • ‍ ‍Video captioning aims to describe the content of videos using natural language, offering extensive applications in areas such as human-computer interaction, assistance for visually impaired individuals, and sports commentary. However, the complex spatiotemporal variations within videos make it challenging to generate accurate captions. Previous methods have attempted to enhance caption quality by extracting spatiotemporal features and leveraging prior information. Despite these efforts, they often struggle with spatiotemporal joint modeling, which can lead to inadequate visual information extraction and negatively impact the quality of generated captions. To address this challenge, we propose a novel model, ST2, which enhances spatiotemporal joint modeling capabilities by incorporating Mamba—a recently popular state-space model (SSM) known for its global receptive field and linear computational complexity. By combining Mamba with the Transformer framework, we introduce a Spatially Enhanced SSM and Transformer (SH-ST) that overcomes the receptive field limitations of convolutional approaches while reducing computational complexity, thereby improving the model’s ability to extract spatial information. To further strengthen temporal modeling, we utilize Mamba’s temporal scanning characteristics in conjunction with the global modeling capabilities of the Transformer. This results in a Temporally Enhanced SSM and Transformer (TH-ST). Specifically, the features generated by SH-ST are reordered to allow Mamba to enhance the temporal relationships of these rearrange features through cross-scanning, after which the Transformer is employed to further bolster temporal modeling capabilities. Experimental results validate the effectiveness of the SH-ST and TH-ST structural designs within our ST2 model, achieving competitive results on widely used video captioning datasets, MSVD and MSR-VTT. Notably, our method surpasses state-of-the-art results, achieving a 6.9% and 2.6% improvement in absolute CIDEr scores on the MSVD and MSR-VTT datasets, respectively, and exceeding baseline results by 4.9% in absolute CIDEr scores on MSVD.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return