Short- and Long- Term Aware Two-Stage Motion Compensation for Video Compression
-
Abstract
In hybrid video coding frameworks, inter-frame prediction is a crucial component for eliminating temporal redundancy and improving coding efficiency. Most existing methods rely solely on the previous frame as a reference. Specifically, motion information between the reference and target frames is extracted via neural networks, encoded and transmitted, and then applied to the reference frame to produce an aligned frame. However, these methods rely on short-term reference frames and thus have limited effectiveness in handling complex scenes such as occlusion and fail to fully utilize high-quality reference frames over longer temporal ranges. Although certain recent approaches attempted to incorporate long-term reference information, they often adopted simple stacking strategies or loss-driven implicit fusion mechanisms that lack targeted guidance for reference frame utilization. To address the issues, this study proposed a Short- and Long-Term Aware Two-Stage Motion Compensation method for video compression. Specifically, we first estimated motion information from short-term reference frames to generate initially aligned features, establishing basic temporal correspondence. Then, we extracted prompt features from long-term reference frames and used the reconstructed reference content to guide detail enhancement of the initially aligned features, thereby effectively alleviating occlusions and motion artifacts with low bitrate overhead. Furthermore, we proposed an Explicit-Implicit Temporal Reference Buffer, in which short-term reference frames were explicitly modeled to preserve high-fidelity spatial details, and long-term reference frames were implicitly modeled to form a compact temporal representation. This mechanism provided stable contextual support for the secondary motion compensation. Experimental results showed that the proposed method achieved superior rate-distortion performance in terms of peak signal-to-noise ratio and multi-scale structural similarity index measure compared with hybrid coding framework VTM-19.0, latest end-to-end video coding method DCVC-RT, and recent representative multi-reference frame video coding method DCVC-SDD. Ablation studies further verified the effectiveness of the proposed Short- and Long- Term Aware Two-Stage Motion Compensation module and the Explicit-Implicit Temporal Reference Buffer module.
-
-