Audio-Visual Dual-Dominant Speech Enhancement Method with Cross-Modal Bidirectional Attention

GUO Feiyang; ZHANG Tianqi; SHEN Xiwen; GAO Yifei

doi:10.12466/xhcl.2025.09.005

GUO Feiyang, ZHANG Tianqi, SHEN Xiwen, et al. Audio-visual dual-dominant speech enhancement method with cross-modal bidirectional attention[J]. Journal of Signal Processing, 2025, 41(9): 1513-1524. DOI: 10.12466/xhcl.2025.09.005.

Citation:

Audio-Visual Dual-Dominant Speech Enhancement Method with Cross-Modal Bidirectional Attention

Abstract

Abstract

To address the issue of audio modality dominance and underutilization of video modality assistance in audiovisual multimodal speech enhancement， this paper proposes an audio-visual dual-dominant-branch cooperative enhancement encoder-decoder architecture. At the encoding stage， the video-dominant branch employs random-dimensional audio masking to simulate audio feature deficiencies under low signal-to-noise ratio （SNR） conditions， using video features to guide the prediction and reconstruction of missing audio features， thereby enhancing the auxiliary effectiveness of the video modality. The intermediate layer adopts a cross-modal bidirectional cross-attention mechanism to model dynamic complementary relationships between audio and visual modalities. The decoding layer integrates dual-branch features through learnable dynamic weighting factors to achieve efficient cross-modal fusion. Experimental validation on the GRID dataset demonstrates that the proposed method significantly improves speech enhancement performance in low-SNR scenarios， achieving improvements of 0.123~0.156 in Perceptual Evaluation of Speech Quality （PESQ） and 1.78%~2.21% in Short-Time Objective Intelligibility （STOI）， outperforming mainstream models in objective evaluations. Ablation studies further confirm the effectiveness of the bidirectional attention architecture and the video-guided masking mechanism， demonstrating that this approach breaks away from the traditional single-modality-dominant interaction paradigm. This enables collaborative cross-modal feature enhancement and robust representation learning.

FullText(HTML)

References (27)

Cited By

Audio-Visual Dual-Dominant Speech Enhancement Method with Cross-Modal Bidirectional Attention

Abstract

Catalog

Export File

Citation

Format

Content