Generative Sequence-to-Sequence Framework for Target Speaker Voice Activity Detection and Diarization
-
Abstract
Speaker diarization development has spanned decades, playing a pivotal role in speech processing, particularly in multi-speaker scenarios. Speaker diarization systems segment multi-speaker audio into homogeneous regions with consistent speaker identities. In the era of big data, extracting single-speaker segments from large volumes of unlabeled speech has become increasingly important, making speaker diarization systems essential tools. Prior research on neural network-based speaker diarization has focused largely on architectural design, with most implementations relying on binary cross-entropy as the optimization objective. This is intuitive, as speaker activity is typically represented as binary label sequences: 0 for absence and 1 for presence of the target speaker.Most such systems use discriminative algorithms that produce fixed outputs for fixed inputs. However, this fixed mapping can limit performance since speaker diarization labels are interval-based and often have uncertain boundaries that may adversely affect discriminative algorithm training. Recently, generative algorithms have gained attention for their iterative inference and distribution modeling capabilities that enable finer-grained results and increase resilience to label uncertainty. Neural network-based diarization systems generally fall into two categories: end-to-end systems and target speaker voice activity detection (TSVAD) systems. In this paper, investigate the application of generative algorithms to a sequence-to-sequence TSVAD system. Building upon an existing implementation, we integrate two generative models—diffusion and flow-matching algorithms—to predict outcome distributions. Experiments show that generative models underperform in the binary label space. To mitigate this, we propose a label autoencoder to compress binary sequences into a low-dimensional continuous latent space. Within this space, our flow-matching algorithm outperforms the baseline. Moreover, the stochastic nature of generative models allows for multiple sampling iterations yielding varying results. Fusing these results leads to further performance improvements—achieving approximately 12% relative improvement over the baseline.Additional findings include the significantly lower performance and slower convergence of diffusion algorithms compared to flow-matching with deterministic sampling paths, because of their inherent stochasticity. Notably, while generative methods in other domains often require over 10 inference steps, only two were steps sufficient for effective diarization. Although our generative model yields variable outputs, performance remains robust, and all samples consistently outperform the discriminative baseline. We also explore the compression of binary label sequences into a continuous low-dimensional latent space through ablation studies to determine optimal configurations. Finally, while this study focuses on TSVAD systems, extending generative methods to end-to-end diarization architectures remains a promising direction for future research.
-
-