Video retrieval system using adaptive spatiotemporal convolution feature representation with dynamic abstraction for video to language translation

A video retrieval system is provided, that includes a set of servers, configured to retrieve a video sequence from a database and forward it to a requesting device responsive to a match between an input text and a caption for the video sequence. The servers are further configured to translate the vi...

Full description

Saved in:
Bibliographic Details
Main Authors Min, Renqiang, Pu, Yunchen
Format Patent
LanguageEnglish
Published 03.09.2019
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:A video retrieval system is provided, that includes a set of servers, configured to retrieve a video sequence from a database and forward it to a requesting device responsive to a match between an input text and a caption for the video sequence. The servers are further configured to translate the video sequence into the caption by (A) applying a C3D to image frames of the video sequence to obtain therefor (i) intermediate feature representations across L convolutional layers and (ii) top-layer features, (B) producing a first word of the caption for the video sequence by applying the top-layer features to a LSTM, and (C) producing subsequent words of the caption by (i) dynamically performing spatiotemporal attention and layer attention using the representations to form a context vector, and (ii) applying the LSTM to the context vector, a previous word of the caption, and a hidden state of the LSTM.
Bibliography:Application Number: US201715794802