GUO Feiyang, ZHANG Tianqi, SHEN Xiwen, et al. Audio-visual dual-dominant speech enhancement method with cross-modal bidirectional attention[J]. Journal of Signal Processing, 2025, 41(9): 1513-1524. DOI: 10.12466/xhcl.2025.09.005.
Citation: GUO Feiyang, ZHANG Tianqi, SHEN Xiwen, et al. Audio-visual dual-dominant speech enhancement method with cross-modal bidirectional attention[J]. Journal of Signal Processing, 2025, 41(9): 1513-1524. DOI: 10.12466/xhcl.2025.09.005.

Audio-Visual Dual-Dominant Speech Enhancement Method with Cross-Modal Bidirectional Attention

  • To address the issue of audio modality dominance and underutilization of video modality assistance in audiovisual multimodal speech enhancement, this paper proposes an audio-visual dual-dominant-branch cooperative enhancement encoder-decoder architecture. At the encoding stage, the video-dominant branch employs random-dimensional audio masking to simulate audio feature deficiencies under low signal-to-noise ratio (SNR) conditions, using video features to guide the prediction and reconstruction of missing audio features, thereby enhancing the auxiliary effectiveness of the video modality. The intermediate layer adopts a cross-modal bidirectional cross-attention mechanism to model dynamic complementary relationships between audio and visual modalities. The decoding layer integrates dual-branch features through learnable dynamic weighting factors to achieve efficient cross-modal fusion. Experimental validation on the GRID dataset demonstrates that the proposed method significantly improves speech enhancement performance in low-SNR scenarios, achieving improvements of 0.123~0.156 in Perceptual Evaluation of Speech Quality (PESQ) and 1.78%~2.21% in Short-Time Objective Intelligibility (STOI), outperforming mainstream models in objective evaluations. Ablation studies further confirm the effectiveness of the bidirectional attention architecture and the video-guided masking mechanism, demonstrating that this approach breaks away from the traditional single-modality-dominant interaction paradigm. This enables collaborative cross-modal feature enhancement and robust representation learning.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return