TY - JOUR
T1 - S-HR-VQVAE
T2 - Sequential Hierarchical Residual Learning Vector Quantized Variational Autoencoder for Video Prediction
AU - Adiban, Mohammad
AU - Stefanov, Kalin
AU - Siniscalchi, Sabato Marco
AU - Salvi, Giampiero
N1 - Publisher Copyright:
© 1999-2012 IEEE.
PY - 2025/1/27
Y1 - 2025/1/27
N2 - We address the video prediction task by putting forth a novel model that combines (i) a novel hierarchical residual learning vector quantized variational autoencoder (HR-VQVAE), and (ii) a novel autoregressive spatiotemporal predictive model (AST-PM). We refer to this approach as a sequential hierarchical residual learning vector quantized variational autoencoder (S-HR-VQVAE). By leveraging the intrinsic capabilities of HR-VQVAE at modeling still images with a parsimonious representation, combined with the AST-PM's ability to handle spatiotemporal information, S-HR-VQVAE can better deal with major challenges in video prediction. These include learning spatiotemporal information, handling high dimensional data, combating blurry prediction, and implicit modeling of physical characteristics. Extensive experimental results on four challenging tasks, namely KTH Human Action, TrafficBJ, Human3.6M, and Kitti, demonstrate that our model compares favorably against state-of-the-art video prediction techniques both in quantitative and qualitative evaluations despite a much smaller model size. Finally, we boost S-HR-VQVAE by proposing a novel training method to jointly estimate the HR-VQVAE and AST-PM parameters.
AB - We address the video prediction task by putting forth a novel model that combines (i) a novel hierarchical residual learning vector quantized variational autoencoder (HR-VQVAE), and (ii) a novel autoregressive spatiotemporal predictive model (AST-PM). We refer to this approach as a sequential hierarchical residual learning vector quantized variational autoencoder (S-HR-VQVAE). By leveraging the intrinsic capabilities of HR-VQVAE at modeling still images with a parsimonious representation, combined with the AST-PM's ability to handle spatiotemporal information, S-HR-VQVAE can better deal with major challenges in video prediction. These include learning spatiotemporal information, handling high dimensional data, combating blurry prediction, and implicit modeling of physical characteristics. Extensive experimental results on four challenging tasks, namely KTH Human Action, TrafficBJ, Human3.6M, and Kitti, demonstrate that our model compares favorably against state-of-the-art video prediction techniques both in quantitative and qualitative evaluations despite a much smaller model size. Finally, we boost S-HR-VQVAE by proposing a novel training method to jointly estimate the HR-VQVAE and AST-PM parameters.
KW - Autoregressive Modeling
KW - Hierarchical Modeling
KW - Video Prediction
UR - https://www.scopus.com/pages/publications/85216902108
U2 - 10.1109/TMM.2025.3535370
DO - 10.1109/TMM.2025.3535370
M3 - Article
AN - SCOPUS:85216902108
SN - 1520-9210
VL - 27
SP - 4321
EP - 4332
JO - IEEE Transactions on Multimedia
JF - IEEE Transactions on Multimedia
ER -