基于卷积重参数化的轻量级多通道语音增强
Lightweight Multichannel Speech Enhancement Based on Reparameterized Convolution
-
摘要: 多通道语音增强利用麦克风阵列的空间感知能力,从含噪信号中提取高质量目标语音,是语音交互、会议通信与听觉辅具等应用的关键预处理环节。然而,在边缘设备受限于实时因果约束及算力与存储预算的条件下,现有方法在低信噪比与复杂噪声场景中难以兼顾性能与效率。为此,本文提出一种面向部署的轻量级因果多通道网络(Multi-Branch Causal Network,MBCNet)。该网络联合编码听觉特征、复数谱与空间线索,并以可重参数化的多分支卷积为核心单元:训练阶段并行建模时域、频域及其耦合信息;推理阶段通过线性等效融合将多分支还原为单核卷积,不引入额外推理开销。同时,设计频率下采样-上采样模块,相较于传统“卷积-转置卷积”组合,在计算量近乎不变的前提下将频率感受野加倍,从而增强宽带噪声抑制能力。进一步地,基于“分支重要度”度量,在模型收敛后对分支进行有约束的增删与微调,使计算资源在关键通道与尺度上自适应再分配,于固定推理复杂度下获得可量化的性能增益。消融实验表明:多分支卷积在等效复杂度下优于常规卷积;去除空间或复数域特征均会劣化指标;所提频率模块在保持因果性的同时更为稳健。对比实验进一步验证,MBCNet以更低的参数规模与计算开销实现优于或可比主流模型的降噪性能,体现出良好的边缘侧有效性与部署潜力。Abstract: Multichannel speech enhancement leverages the spatial perception of microphone arrays to extract high-quality target speech from noisy mixtures, thereby serving as a critical preprocessing stage for automatic speech recognition, teleconferencing, and assistive hearing. Although deep neural approaches currently dominate—ranging from hybrids that couple learning with classical spatial filtering to fully neural beamforming—their deployment on edge devices remains difficult. Models must simultaneously satisfy strict real-time causality, tight compute and memory budgets, and high accuracy under low signal-to-noise ratio and nonstationary, spatially complex noise. Existing lightweight solutions often fall short of this triad, and methods that stay below a few hundred multiple model adaptive controls per second (MMACs/s) while remaining competitive at low SNR are rare. To address these limitations, we propose a multi-branch causal network (MBCNet), which has a deployment-oriented, lightweight multichannel architecture built around convolutional reparameterization. MBCNet jointly encodes auditory features, complex spectral representations, and spatial cues. Its backbone comprises three parts: (i) a parallel feature encoder that aligns and fuses the three streams; (ii) a deep extractor with symmetric encoder-decoder and multilevel frequency downsampling-upsampling blocks to expand the effective frequency receptive field; and (iii) a mask estimation head that predicts multichannel complex filters for enhanced signal reconstruction. Self-attention components are integrated where beneficial to capture the long-range dependencies without violating causality. The first key contribution is the reparameterizable multibranch convolution (RepMBConv). During training, RepMBConv uses five coordinated branches—temporal, spectral, joint time-frequency, refinement, and identity—to enrich feature diversity and learn complementary inductive biases. At inference, the branches are analytically fused into a single convolutional kernel through linear equivalence, incurring zero extra computational overhead. Branch-importance analysis further reveals a hierarchical learning behavior, whereby shallow stages emphasize local refinement, whereas deeper stages prioritize temporal and spectral abstractions. We exploited this property after convergence to add, prune, and fine-tune branches, reallocating capacity to critical channels and scales to yield measurable gains without increasing complexity. The second contribution is a frequency downsampling-upsampling module that replaces conventional pairs of convolution and transpose convolution. Downsampling is realized by frequency-index splitting, channel stacking, and convolution, with upsampling reversing this process via channel separation, frequency-index recombination, and convolution. This design doubles the frequency receptive field without increasing computational cost, improves broadband noise suppression, and avoids the artifacts associated with deconvolution, all while preserving streaming causality. Ablation studies confirm RepMBConv’s superiority over standard and dilated convolutions under matched complexity, demonstrating that removing spatial or complex-domain features degrades performance. In comparative experiments, MBCNet achieves superior or comparable denoising performance with fewer parameters and lower computational cost, validating its effectiveness and deployment potential on edge devices.
下载: