Spatio-Temporal Attentive Fusion Unit for Effective Video Prediction

Singh B.; Singh D.; Kaushal R.; Reddy S.V.K.; Jaswanth B.S.; Chattopadhyay P.

doi:https://doi.org/10.1007/978-3-031-78456-9_5

Spatio-Temporal Attentive Fusion Unit for Effective Video Prediction

Authors

Singh B.; Singh D.; Kaushal R.; Reddy S.V.K.; Jaswanth B.S.; Chattopadhyay P.

Abstract

In recent years, Recurrent Neural Networks (RNNs) have been extensively used for video frame prediction. The use of attention in recurrent networks helps in better memory utilization and to date, only a few recent video frame prediction models have used attention to capture long-term motion information. Still, these methods fail to preserve the structural consistency resulting in blurry output and fading out of smaller objects. We propose a new RNN-based spatio-temporal prediction unit with attention termed Spatio-Temporal Attentive Fusion Unit (STAFU) that combines temporal motion information and spatial appearance information respectively through a temporal attention unit and a spatial attention unit to preserve long-term sequence information at high resolution. The outputs from the two attention units are next aggregated through a hybrid aggregation unit with a wide receptive field for both the spatial and temporal features, which causes high-quality video prediction. The above units are embedded within a GAN framework that is trained in an end-to-end fashion. Our approach has been evaluated on three public datasets, namely Moving-MNIST, KTH-Action, and ETHZ, and interesting results have been obtained. A comparative study with other recent models shows that, on average, our model performs better and more consistently than the others in terms of the different metrics, namely MSE, MAE, SSIM, PSNR, and LPIPS. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.