Translating video to language using adaptive spatiotemporal convolution feature representation with dynamic abstraction

A system is provided for video captioning. The system includes a processor. The processor is configured to apply a three-dimensional Convolutional Neural Network (C3D) to image frames of a video sequence to obtain, for the video sequence, (i) intermediate feature representations across L convolution...

Full description

Saved in:

Bibliographic Details
Main Authors	Min, Renqiang, Pu, Yunchen
Format	Patent
Language	English
Published	30.07.2019
Subjects	CALCULATING COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS COMPUTING COUNTING ELECTRIC COMMUNICATION TECHNIQUE ELECTRICITY HANDLING RECORD CARRIERS PHYSICS PICTORIAL COMMUNICATION, e.g. TELEVISION PRESENTATION OF DATA RECOGNITION OF DATA RECORD CARRIERS
Online Access	Get full text

Cover

Loading…

More Information
Summary:	A system is provided for video captioning. The system includes a processor. The processor is configured to apply a three-dimensional Convolutional Neural Network (C3D) to image frames of a video sequence to obtain, for the video sequence, (i) intermediate feature representations across L convolutional layers and (ii) top-layer features. The processor is further configured to produce a first word of an output caption for the video sequence by applying the top-layer features to a Long Short Term Memory (LSTM). The processor is further configured to produce subsequent words of the output caption by (i) dynamically performing spatiotemporal attention and layer attention using the intermediate feature representations to form a context vector, and (ii) applying the LSTM to the context vector, a previous word of the output caption, and a hidden state of the LSTM. The system further includes a display device for displaying the output caption to a user.
Bibliography:	Application Number: US201715794758