ZHANG Wangyou, QIAN Yanmin. Unified speech enhancement under varying input conditions[J]. Journal of Signal Processing, 2025, 41(9): 1494-1512. DOI: 10.12466/xhcl.2025.09.004.
Citation: ZHANG Wangyou, QIAN Yanmin. Unified speech enhancement under varying input conditions[J]. Journal of Signal Processing, 2025, 41(9): 1494-1512. DOI: 10.12466/xhcl.2025.09.004.

Unified Speech Enhancement Under Varying Input Conditions

  • Intelligent speech interaction systems are increasingly deployed across a range of real-world applications. However, their performance often degrades significantly in complex acoustic environments because of diverse and challenging conditions. These conditions include non-linear speech distortions, varying levels of reverberation in indoor spaces, and different types of background noise in public areas. Additionally, hardware differences—such as variations in sampling rates, microphone types, number of channels, and array geometries—further complicate the problem. In contrast, traditional deep learning-based speech enhancement (SE) techniques are typically designed with narrow specialization, focusing on specific scenarios or hardware configurations. For instance, many models are trained exclusively for either single-channel or fixed multi-channel setups, or for particular sampling rates. This specialization creates challenges in real-world deployment where multiple device configurations may coexist, leading to increased system complexity and resource requirements. Recent advances in signal processing and deep learning offer new opportunities to address these limitations. One promising direction is the development of unified SE techniques capable of handling speech signals with varying input conditions in a single model. Such a unified approach can overcome the limited scope of conventional methods by enabling models to automatically adapt to different input characteristics without explicit reconfiguration or model switching. Despite its practical importance, this area remains underexplored. Most existing SE research focuses on constrained, scenario-specific conditions. Motivated by this gap, we present a comprehensive study on unified SE techniques and propose the first unified model— Unconstrained Speech Enhancement and Separation (USES)—designed to operate under diverse input conditions. USES can effectively process speech signals with varying sampling rates, microphone numbers and geometrical arrays, durations, and acoustic environments in a unified manner. Compared with prior work, this is the first SE model to support such a wide range of input formats, incorporating innovations in multi-domain data preparation, model architecture, and training framework. Extensive experiments on standard SE benchmarks (e.g., VoiceBank+DEMAND, DNS-2020, CHiME-4) and the URGENT 2025 Challenge dataset demonstrate that USES not only achieves state-of-the-art performance on simulated evaluation data but also significantly improves robustness in real-world conditions. For example, USES outperformed leading models on both the WSJ0-2mix speech separation task and the DNS-2020 denoising benchmark, while successfully unifying support for varied sampling rates and microphone setups. Additionally, the unified model reduced computational costs by 52% and 51% when processing 16 kHz and 48 kHz inputs, respectively, compared with the high-performing TF-GridNet baseline—achieving similar or better performance with lower complexity.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return