长短程二次运动补偿的视频编码
Short- and Long- Term Aware Two-Stage Motion Compensation for Video Compression
-
摘要: 在混合视频编码框架中,帧间预测是消除时间冗余、提升编码效率的关键环节。现有方法普遍仅依赖相邻前一帧作为参考,通过估计参考帧与目标帧之间的相对运动,对运动及残差进行编码与传输,并在解码端重构预测帧。然而,这类方法在遮挡、快速运动等复杂场景中表现受阻,且难以利用长程高质量的参考帧。尽管已有研究尝试引入长程参考信息,但多采用简单堆叠或损失函数驱动的隐式融合策略,缺乏针对性的参考帧引导机制。针对上述问题,本文提出一种长短程二次运动补偿方法。具体地,利用短程参考帧估计运动信息并生成初始对齐特征,从而建立基础的时序对应关系。随后从长程参考帧中提取提示特征,引导网络对初始对齐结果进行细节增强,从而在较低码率开销下有效缓解遮挡与运动伪影。进一步地,本文设计了一种显式-隐式时间参考机制,用于统一管理不同时间范围内的参考信息。其中,短程参考帧采用显式建模以保留精细空间细节,而长程参考帧通过隐式建模形成紧凑的时域表达,从而为二次补偿提供稳定且互补的上下文支持。实验结果表明,本文方法在峰值信噪比和多尺度结构相似度指标下,相较于混合编码标准VTM-19.0、端到端视频编码方法DCVC-RT以及多参考帧方法DCVC-SDD均取得了更优的性能。消融实验进一步验证了长短程二次运动补偿策略及显式-隐式时间参考建模机制的有效性。Abstract: In hybrid video coding frameworks, inter-frame prediction is a crucial component for eliminating temporal redundancy and improving coding efficiency. Most existing methods rely solely on the previous frame as a reference. Specifically, motion information between the reference and target frames is extracted via neural networks, encoded and transmitted, and then applied to the reference frame to produce an aligned frame. However, these methods rely on short-term reference frames and thus have limited effectiveness in handling complex scenes such as occlusion and fail to fully utilize high-quality reference frames over longer temporal ranges. Although certain recent approaches attempted to incorporate long-term reference information, they often adopted simple stacking strategies or loss-driven implicit fusion mechanisms that lack targeted guidance for reference frame utilization. To address the issues, this study proposed a Short- and Long-Term Aware Two-Stage Motion Compensation method for video compression. Specifically, we first estimated motion information from short-term reference frames to generate initially aligned features, establishing basic temporal correspondence. Then, we extracted prompt features from long-term reference frames and used the reconstructed reference content to guide detail enhancement of the initially aligned features, thereby effectively alleviating occlusions and motion artifacts with low bitrate overhead. Furthermore, we proposed an Explicit-Implicit Temporal Reference Buffer, in which short-term reference frames were explicitly modeled to preserve high-fidelity spatial details, and long-term reference frames were implicitly modeled to form a compact temporal representation. This mechanism provided stable contextual support for the secondary motion compensation. Experimental results showed that the proposed method achieved superior rate-distortion performance in terms of peak signal-to-noise ratio and multi-scale structural similarity index measure compared with hybrid coding framework VTM-19.0, latest end-to-end video coding method DCVC-RT, and recent representative multi-reference frame video coding method DCVC-SDD. Ablation studies further verified the effectiveness of the proposed Short- and Long- Term Aware Two-Stage Motion Compensation module and the Explicit-Implicit Temporal Reference Buffer module.
下载: