多输入场景通用的一体化语音增强技术
Unified Speech Enhancement Under Varying Input Conditions
-
摘要: 智能语音交互系统在实际应用中往往面临着复杂环境中的多样化声学场景特性、麦克风配置等挑战,而基于深度学习的传统语音增强技术往往仅针对单一或有限场景进行设计,难以直接应用于差异较大的应用场景和硬件设备。随着信号处理理论和深度学习技术的发展成熟,研究一体化语音增强技术成为解决上述问题的一个重要途径,其旨在构建单个语音增强模型来统一处理来自不同输入场景、具有不同输入形式的语音信号,从而能够克服传统方法适用范围受限的不足。尽管在实际应用中具有巨大潜力和应用价值,这一研究方向仍然处于初步探索阶段,因为大部分语音增强研究仅聚焦于特定场景。为此,本文围绕一体化语音增强方向开展了系统性研究,提出了首个多输入场景通用的一体化语音增强模型──非受限语音增强与分离(Unconstrained Speech Enhancement and Separation, USES),它能够高效处理不同采样率、不同麦克风数量和阵列结构、不同时长以及不同声学场景的语音信号。区别于前人工作,这是首个能够广泛支持不同语音信号输入形式的语音增强研究,在多样化数据准备、模型架构设计、训练框架方面均进行了创新性探索。本文在VoiceBank+DEMAND、DNS-2020、CHiME-4等覆盖多样化场景的经典数据集以及最新的URGENT 2025比赛数据集上进行了广泛实验验证,实验表明本文所提出的模型不仅能够在广泛使用的仿真数据上取得优越性能,也能显著提升在多种真实数据上的增强性能,其中在多通道WSJ0-2mix语音分离数据集和DNS-2020语音降噪数据集上均取得了超过现有方法的最优性能,并首次展现出针对不同采样率、麦克风配置等输入形式的一体化建模能力。进一步分析表明,所提出的一体化方法在实际部署方面也体现出优势,能取得与已有的主流高性能TF-GridNet基线模型接近的性能,同时分别减少52%和51%处理16 kHz和48 kHz语音信号时所需要的计算量。Abstract: Intelligent speech interaction systems are increasingly deployed across a range of real-world applications. However, their performance often degrades significantly in complex acoustic environments because of diverse and challenging conditions. These conditions include non-linear speech distortions, varying levels of reverberation in indoor spaces, and different types of background noise in public areas. Additionally, hardware differences—such as variations in sampling rates, microphone types, number of channels, and array geometries—further complicate the problem. In contrast, traditional deep learning-based speech enhancement (SE) techniques are typically designed with narrow specialization, focusing on specific scenarios or hardware configurations. For instance, many models are trained exclusively for either single-channel or fixed multi-channel setups, or for particular sampling rates. This specialization creates challenges in real-world deployment where multiple device configurations may coexist, leading to increased system complexity and resource requirements. Recent advances in signal processing and deep learning offer new opportunities to address these limitations. One promising direction is the development of unified SE techniques capable of handling speech signals with varying input conditions in a single model. Such a unified approach can overcome the limited scope of conventional methods by enabling models to automatically adapt to different input characteristics without explicit reconfiguration or model switching. Despite its practical importance, this area remains underexplored. Most existing SE research focuses on constrained, scenario-specific conditions. Motivated by this gap, we present a comprehensive study on unified SE techniques and propose the first unified model— Unconstrained Speech Enhancement and Separation (USES)—designed to operate under diverse input conditions. USES can effectively process speech signals with varying sampling rates, microphone numbers and geometrical arrays, durations, and acoustic environments in a unified manner. Compared with prior work, this is the first SE model to support such a wide range of input formats, incorporating innovations in multi-domain data preparation, model architecture, and training framework. Extensive experiments on standard SE benchmarks (e.g., VoiceBank+DEMAND, DNS-2020, CHiME-4) and the URGENT 2025 Challenge dataset demonstrate that USES not only achieves state-of-the-art performance on simulated evaluation data but also significantly improves robustness in real-world conditions. For example, USES outperformed leading models on both the WSJ0-2mix speech separation task and the DNS-2020 denoising benchmark, while successfully unifying support for varied sampling rates and microphone setups. Additionally, the unified model reduced computational costs by 52% and 51% when processing 16 kHz and 48 kHz inputs, respectively, compared with the high-performing TF-GridNet baseline—achieving similar or better performance with lower complexity.