跨模态双向注意力的视听双主导语音增强方法

郭飞扬; 张天骐; 沈夕文; 高逸飞

doi:10.12466/xhcl.2025.09.005

跨模态双向注意力的视听双主导语音增强方法

Audio-Visual Dual-Dominant Speech Enhancement Method with Cross-Modal Bidirectional Attention

摘要

摘要: 针对视听多模态语音增强中音频模态占据主导地位，视频模态无法充分发挥辅助作用问题，提出一种音视频双主导支路协同增强的编解码器结构。在编码层，视频主导支路为强化视频模态的辅助效能，采用随机维度音频掩码模拟低信噪比条件下的音频特征缺失，利用视频特征指导缺失音频特征的预测与重构。中间层采用跨模态双向交叉注意力机制建模视听模态的动态互补关系。解码层通过可学习的动态权重因子整合双支路特征，实现跨模态信息的高效融合。实验验证在GRID数据集上展开，结果表明所提方法有效提升低信噪比场景的语音增强性能。在语音感知质量评估（Perceptual Evaluation of Speech Quality， PESQ）和短时客观可懂度（Short-Time Objective Intelligibility， STOI）两项核心指标上分别实现0.123~0.156和1.78%~2.21%的提升，较现有主流模型在客观评估中均展现出优势。消融实验进一步证实双向注意力结构与视频引导掩码机制的有效性，证明该方法能够突破传统单模态主导的交互范式，实现跨模态特征的协同增强与鲁棒表征。

Abstract: To address the issue of audio modality dominance and underutilization of video modality assistance in audiovisual multimodal speech enhancement， this paper proposes an audio-visual dual-dominant-branch cooperative enhancement encoder-decoder architecture. At the encoding stage， the video-dominant branch employs random-dimensional audio masking to simulate audio feature deficiencies under low signal-to-noise ratio （SNR） conditions， using video features to guide the prediction and reconstruction of missing audio features， thereby enhancing the auxiliary effectiveness of the video modality. The intermediate layer adopts a cross-modal bidirectional cross-attention mechanism to model dynamic complementary relationships between audio and visual modalities. The decoding layer integrates dual-branch features through learnable dynamic weighting factors to achieve efficient cross-modal fusion. Experimental validation on the GRID dataset demonstrates that the proposed method significantly improves speech enhancement performance in low-SNR scenarios， achieving improvements of 0.123~0.156 in Perceptual Evaluation of Speech Quality （PESQ） and 1.78%~2.21% in Short-Time Objective Intelligibility （STOI）， outperforming mainstream models in objective evaluations. Ablation studies further confirm the effectiveness of the bidirectional attention architecture and the video-guided masking mechanism， demonstrating that this approach breaks away from the traditional single-modality-dominant interaction paradigm. This enables collaborative cross-modal feature enhancement and robust representation learning.

HTML全文

参考文献(27)

施引文献

资源附件(0)