Video retrieval system using adaptive spatiotemporal convolution feature representation with dynamic abstraction for video to language translation
A video retrieval system is provided, that includes a set of servers, configured to retrieve a video sequence from a database and forward it to a requesting device responsive to a match between an input text and a caption for the video sequence. The servers are further configured to translate the vi...
Saved in:
Main Authors | , |
---|---|
Format | Patent |
Language | English |
Published |
03.09.2019
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | A video retrieval system is provided, that includes a set of servers, configured to retrieve a video sequence from a database and forward it to a requesting device responsive to a match between an input text and a caption for the video sequence. The servers are further configured to translate the video sequence into the caption by (A) applying a C3D to image frames of the video sequence to obtain therefor (i) intermediate feature representations across L convolutional layers and (ii) top-layer features, (B) producing a first word of the caption for the video sequence by applying the top-layer features to a LSTM, and (C) producing subsequent words of the caption by (i) dynamically performing spatiotemporal attention and layer attention using the representations to form a context vector, and (ii) applying the LSTM to the context vector, a previous word of the caption, and a hidden state of the LSTM. |
---|---|
Bibliography: | Application Number: US201715794802 |