A Hybrid Transformer-LSTM Model With 3D Separable Convolution for Video Prediction

Video prediction is an essential vision task due to its wide applications in real-world scenarios. However, it is indeed challenging due to the inherent uncertainty and complex spatiotemporal dynamics of video content. Several state-of-the-art deep learning methods have achieved superior video predi...

Full description

Saved in:

Bibliographic Details
Published in	IEEE access Vol. 12; pp. 39589 - 39602
Main Authors	Mathai, Mareeta, Liu, Ying, Ling, Nam
Format	Journal Article
Language	English
Published	Piscataway IEEE 2024 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	3D separable convolution Accuracy Artificial intelligence Complexity Computational efficiency Computational modeling Computing costs Convolutional neural networks Data models Datasets Deep learning depthwise convolution Green design Hybrid structures Long short term memory LSTM Machine learning Memory devices pointwise convolution Predictive models self-attention spatiotemporal modeling Spatiotemporal phenomena Three dimensional models Three-dimensional displays Training data transformer Transformers video prediction Videos Visual analytics visual communications Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Video prediction is an essential vision task due to its wide applications in real-world scenarios. However, it is indeed challenging due to the inherent uncertainty and complex spatiotemporal dynamics of video content. Several state-of-the-art deep learning methods have achieved superior video prediction accuracy at the expense of huge computational cost. Hence, they are not suitable for devices with limitations in memory and computational resource. In the light of Green Artificial Intelligence (AI), more environment friendly deep learning solutions are desired to tackle the problem of large models and computational cost. In this work, we propose a novel video prediction network 3DTransLSTM, which adopts a hybrid transformer-long short-term memory (LSTM) structure to inherit the merits of both self-attention and recurrence. Three-dimensional (3D) depthwise separable convolutions are used in this hybrid structure to extract spatiotemporal features, meanwhile enhancing model efficiency. We conducted experimental studies on four popular video prediction datasets. Compared to existing methods, our proposed 3DTransLSTM achieved competitive frame prediction accuracy with significantly reduced model size, trainable parameters, and computational complexity. Moreover, we demonstrate the generalization ability of the proposed model by testing the model on dataset completely unseen in the training data.
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2024.3375365