基于潜空间残差去噪的视频语义通信

许铭楷; 吴泳澎; 张文军

doi:10.12466/xhcl.2025.10.004

摘要: 随着多媒体应用的普及，视频等视觉数据在网络流量中的占比显著增加，由此对高可靠、低延迟的数据传输提出了更高要求。传统的分离式信源信道编码方案在动态变化的信道环境下面临性能瓶颈，而语义通信作为一种新兴范式，通过提取和传输信源语义信息，能够有效提升传输效率和鲁棒性。然而，现有视觉语义通信中的潜在空间去噪方法仍存在计算复杂度高、语义恢复保真度不足等问题。针对这些挑战，本文提出了一种基于隐空间残差去噪的视频语义通信框架。该框架采用Swin Transformer构建联合信源信道编解码器，结合残差学习和相似度学习设计迭代语义去噪器，通过残差映射直接预测并去除信道噪声，显著提升了去噪效率。此外，引入信道信噪比（Signal-to-Noise Ratio， SNR）驱动的相似度分数作为条件输入，动态调整去噪强度，并通过自适应去噪步数策略平衡性能与延迟。仿真结果表明，所提方法在加性高斯白噪声（Additive White Gaussian Noise， AWGN）信道下能够有效消除噪声，重建视频在多个失真和感知类评价指标上均优于传统分离式编码和端到端语义通信方案，尤其在高噪声环境下表现出更强的鲁棒性。同时，蒙特卡罗仿真验证了所提出的基于SNR的相似度初始化公式可准确逼近经验分布。

Abstract: With the proliferation of multimedia applications， visual data， such as video has come to dominate network traffic， thereby imposing increasingly stringent requirements on high-reliability and low-latency data transmission. Conventional separate-source and channel-coding schemes face performance bottlenecks in dynamic channel environments. As a novel communication paradigm， semantic communication improves transmission efficiency and robustness by extracting and transmitting semantic information from the source. However， existing latent-space denoising methods in visual semantic communication usually suffer from high computational complexity and insufficient semantic fidelity. To address these challenges， this paper proposes a video semantic communication framework based on latent-space residual denoising. The framework employs a Swin Transformer-based joint source-channel codec， and incorporates an iterative semantic denoiser designed by residual learning and similarity-based learning. The residual mapping directly predicts and removes channel noise， thereby significantly improving the denoising efficiency. Additionally， a signal-to-noise ratio （SNR）-driven similarity score is introduced as a conditional input to dynamically adjust the denoising intensity， and an adaptive denoising-step strategy is employed to balance performance and latency. Simulation results demonstrate that the proposed method effectively suppresses noise over additive white Gaussian noise （AWGN） channels. Moreover， it can outperform conventional separate coding and end-to-end semantic communication schemes across various distortion and perceptual video metrics， particularly under high-noise conditions. Furthermore， the proposed SNR-driven similarity score initialization achieves close approximation to the empirical distribution， as validated by Monte Carlo simulations.

基于潜空间残差去噪的视频语义通信

Latent Space Residual Denoising for Video Semantic Communication