基于生成式算法的序列到序列目标说话人检测和日志系统
Generative Sequence-to-Sequence Framework for Target Speaker Voice Activity Detection and Diarization
-
摘要: 通常来说,神经网络说话人日志系统都是通过判别式算法来实现,也就是说给定固定的输入会得到固定的输出。这种方法可能会存在一定的问题,因为说话人日志的标签往往是区域标注,这种标注的说话人区域边界往往存在一定的误差,这些误差也可能会影响判别式算法的训练。最近,生成式算法吸引了很多研究人员的关注,生成式算法的推理过程往往是一个迭代的过程,可以得到更精细的结果。同时,生成式算法对分布建模的本质也会使其受到说话人标签误差的影响比较小。基于神经网络的说话人日志系统大体可分为两类,端到端说话人日志系统和目标说话人活动检测系统。在这篇文章中,我们尝试将生成式算法用到序列到序列的目标说话人检测系统中。在这种目标说话人活动检测系统的实现基础上,实现了两种生成式算法来预测结果的分布,分别是扩散算法(Diffusion)和流匹配算法(Flow-Matching)。在实验中,我们发现在语音活动的二值标签空间上实现生成式算法效果不佳。为此,提出了一个标签自编码器将二值标签序列压缩到一个更加低维且连续的隐空间。在这个隐空间上,我们提出的基于流匹配的算法超过了基线系统。此外,由于生成式算法预测的是结果的分布,因此多次采样生成式算法的结果并不相同。我们发现将流匹配算法多次采样的结果做结果融合还能进一步提升系统,最终系统相比于基线系统取得了大约12%的相对提升。Abstract: Speaker diarization development has spanned decades, playing a pivotal role in speech processing, particularly in multi-speaker scenarios. Speaker diarization systems segment multi-speaker audio into homogeneous regions with consistent speaker identities. In the era of big data, extracting single-speaker segments from large volumes of unlabeled speech has become increasingly important, making speaker diarization systems essential tools. Prior research on neural network-based speaker diarization has focused largely on architectural design, with most implementations relying on binary cross-entropy as the optimization objective. This is intuitive, as speaker activity is typically represented as binary label sequences: 0 for absence and 1 for presence of the target speaker.Most such systems use discriminative algorithms that produce fixed outputs for fixed inputs. However, this fixed mapping can limit performance since speaker diarization labels are interval-based and often have uncertain boundaries that may adversely affect discriminative algorithm training. Recently, generative algorithms have gained attention for their iterative inference and distribution modeling capabilities that enable finer-grained results and increase resilience to label uncertainty. Neural network-based diarization systems generally fall into two categories: end-to-end systems and target speaker voice activity detection (TSVAD) systems. In this paper, investigate the application of generative algorithms to a sequence-to-sequence TSVAD system. Building upon an existing implementation, we integrate two generative models—diffusion and flow-matching algorithms—to predict outcome distributions. Experiments show that generative models underperform in the binary label space. To mitigate this, we propose a label autoencoder to compress binary sequences into a low-dimensional continuous latent space. Within this space, our flow-matching algorithm outperforms the baseline. Moreover, the stochastic nature of generative models allows for multiple sampling iterations yielding varying results. Fusing these results leads to further performance improvements—achieving approximately 12% relative improvement over the baseline.Additional findings include the significantly lower performance and slower convergence of diffusion algorithms compared to flow-matching with deterministic sampling paths, because of their inherent stochasticity. Notably, while generative methods in other domains often require over 10 inference steps, only two were steps sufficient for effective diarization. Although our generative model yields variable outputs, performance remains robust, and all samples consistently outperform the discriminative baseline. We also explore the compression of binary label sequences into a continuous low-dimensional latent space through ablation studies to determine optimal configurations. Finally, while this study focuses on TSVAD systems, extending generative methods to end-to-end diarization architectures remains a promising direction for future research.