基于生成对抗网络的到达时间差估计器

代浩阳; 呼德

doi:10.12466/xhcl.2025.09.011

摘要: 到达时间差（Time Difference of Arrival， TDOA）是重要的声学空间特征，其广泛应用于多通道声信号处理任务中。传统的TDOA估计器，如广义互相关-相位变换（Generalized Cross-Correlation with Phase Transform， GCC-PHAT）方法，在理想声学环境下表现优异，但在低信噪比、高混响等复杂场景中性能却显著下降。近来，随着深度学习技术的快速发展，涌现出一批基于数据驱动的TDOA估计器，其估计精度较高，但对强噪声与高混响的鲁棒性仍有限。为此，本文提出了一种基于生成对抗网络（Generative Adversarial Network， GAN）的TDOA估计器，通过对抗训练机制增强模型在低信噪比和高混响环境下的鲁棒性。本文的创新性主要体现在：首次基于GAN框架实现TDOA估计，通过生成器与判别器的对抗训练机制，显著提升了模型的泛化性能；生成器先采用门控循环单元（Gated Recurrent Unit， GRU）对原始音频进行扩维处理，并基于GCC-PHAT变换提取互相关特征，以增强模型对时延信息的敏感性；判别器则基于卷积神经网络（Convolutional Neural Network， CNN）构建，通过多层卷积结构提取输入信号的高维特征，结合输入的TDOA真值或预测值，输出置信度评分；生成器同时优化交叉熵损失和对抗损失，判别器则同时提升对真实TDOA及生成器预测TDOA的鉴别能力。上述设计参考了Wasserstein GAN（WGAN）的思想，将判别器输出的置信度评分作为生成器损失函数的一部分，这不仅可以提升模型训练的稳定性，还能克服模式崩溃等问题，也能提升传统单一损失函数、单一训练模式的性能上限。为验证所提方法的有效性，我们在公开数据集上进行了对比实验，对比方法包括经典的GCC-PHAT方法以及最新的深度学习TDOA估计器。实验结果表明，所提方法在低信噪比、高混响环境中表现优异，其TDOA估计精度显著优于对比方法。

Abstract: The time difference of arrival （TDOA） is a crucial acoustic spatial characteristic that is widely employed in multichannel audio signal processing applications. Traditional TDOA estimators， such as the generalized cross-correlation with phase transform （GCC-PHAT） method， exhibit superior performance under ideal acoustic conditions. However， their accuracy deteriorates significantly under low signal-to-noise ratio （SNR） and strong reverberation conditions. Recent advances in deep learning have spurred the development of data-driven TDOA estimators with high estimation accuracy but limited robustness under severe noise and high reverberation conditions. To address these limitations， this paper proposes a generative adversarial network （GAN）-based TDOA estimator that enhances the robustness of models in low-SNR and highly reverberant environments through adversarial training mechanisms. This study is the first to propose a GAN-based TDOA estimation framework that significantly improves model generalization via adversarial training between the generator and the discriminator. The generator employs gated recurrent units （GRUs） for dimensional expansion of raw audio signals and extracts GCC-PHAT-based cross-correlation features to enhance the model’s sensitivity to time-delay information. The convolutional neural network-based discriminator utilizes multilayer convolutional structures to extract high-dimensional features， which are then fused with either the ground-truth or predicted TDOA values to obtain confidence scores. The generator is optimized using a joint loss function that combines cross-entropy and adversarial losses， while the discriminator shows enhanced discrimination capability for both real and generated TDOA estimates. This design incorporates principles from Wasserstein GANs （WGANs） by integrating the discriminator’s output confidence scores into the generator’s loss function. This approach not only substantially stabilizes model training but also effectively resolves mode collapse issues， and thus， the corresponding performance surpasses the performance boundaries of conventional single-loss-function training schemes. To validate the effectiveness of the proposed method， we conducted comparative experiments on public datasets and thus compared the performance of the proposed framework with those of the classical GCC-PHAT method and state-of-the-art deep learning-based TDOA estimators. The experimental results demonstrate that our method achieves superior performance in acoustic environments characterized by low SNRs and strong reverberation. Thus， it statistically outperforms all baseline methods in terms of TDOA estimation accuracy.

基于生成对抗网络的到达时间差估计器

Time Difference of Arrival Estimator Based on Generative Adversarial Networks