Translating video to language using adaptive spatiotemporal convolution feature representation with dynamic abstraction

A system is provided for video captioning. The system includes a processor. The processor is configured to apply a three-dimensional Convolutional Neural Network (C3D) to image frames of a video sequence to obtain, for the video sequence, (i) intermediate feature representations across L convolution...

Full description

Saved in:
Bibliographic Details
Main Authors Min, Renqiang, Pu, Yunchen
Format Patent
LanguageEnglish
Published 30.07.2019
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:A system is provided for video captioning. The system includes a processor. The processor is configured to apply a three-dimensional Convolutional Neural Network (C3D) to image frames of a video sequence to obtain, for the video sequence, (i) intermediate feature representations across L convolutional layers and (ii) top-layer features. The processor is further configured to produce a first word of an output caption for the video sequence by applying the top-layer features to a Long Short Term Memory (LSTM). The processor is further configured to produce subsequent words of the output caption by (i) dynamically performing spatiotemporal attention and layer attention using the intermediate feature representations to form a context vector, and (ii) applying the LSTM to the context vector, a previous word of the output caption, and a hidden state of the LSTM. The system further includes a display device for displaying the output caption to a user.
Bibliography:Application Number: US201715794758