MSR-VTT: A Large Video Description Dataset for Bridging Video and Language

While there has been increasing interest in the task of describing video with natural language, current computer vision algorithms are still severely limited in terms of the variability and complexity of the videos and their associated language that they can recognize. This is in part due to the sim...

Full description

Saved in:
Bibliographic Details
Published in2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 5288 - 5296
Main Authors Jun Xu, Tao Mei, Ting Yao, Yong Rui
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.06.2016
Subjects
Online AccessGet full text
ISSN1063-6919
DOI10.1109/CVPR.2016.571

Cover

Loading…
Abstract While there has been increasing interest in the task of describing video with natural language, current computer vision algorithms are still severely limited in terms of the variability and complexity of the videos and their associated language that they can recognize. This is in part due to the simplicity of current benchmarks, which mostly focus on specific fine-grained domains with limited videos and simple descriptions. While researchers have provided several benchmark datasets for image captioning, we are not aware of any large-scale video description dataset with comprehensive categories yet diverse video content. In this paper we present MSR-VTT (standing for "MSRVideo to Text") which is a new large-scale video benchmark for video understanding, especially the emerging task of translating video to text. This is achieved by collecting 257 popular queries from a commercial video search engine, with 118 videos for each query. In its current version, MSR-VTT provides 10K web video clips with 41.2 hours and 200K clip-sentence pairs in total, covering the most comprehensive categories and diverse visual content, and representing the largest dataset in terms of sentence and vocabulary. Each clip is annotated with about 20 natural sentences by 1,327 AMT workers. We present a detailed analysis of MSR-VTT in comparison to a complete set of existing datasets, together with a summarization of different state-of-the-art video-to-text approaches. We also provide an extensive evaluation of these approaches on this dataset, showing that the hybrid Recurrent Neural Networkbased approach, which combines single-frame and motion representations with soft-attention pooling strategy, yields the best generalization capability on MSR-VTT.
AbstractList While there has been increasing interest in the task of describing video with natural language, current computer vision algorithms are still severely limited in terms of the variability and complexity of the videos and their associated language that they can recognize. This is in part due to the simplicity of current benchmarks, which mostly focus on specific fine-grained domains with limited videos and simple descriptions. While researchers have provided several benchmark datasets for image captioning, we are not aware of any large-scale video description dataset with comprehensive categories yet diverse video content. In this paper we present MSR-VTT (standing for "MSRVideo to Text") which is a new large-scale video benchmark for video understanding, especially the emerging task of translating video to text. This is achieved by collecting 257 popular queries from a commercial video search engine, with 118 videos for each query. In its current version, MSR-VTT provides 10K web video clips with 41.2 hours and 200K clip-sentence pairs in total, covering the most comprehensive categories and diverse visual content, and representing the largest dataset in terms of sentence and vocabulary. Each clip is annotated with about 20 natural sentences by 1,327 AMT workers. We present a detailed analysis of MSR-VTT in comparison to a complete set of existing datasets, together with a summarization of different state-of-the-art video-to-text approaches. We also provide an extensive evaluation of these approaches on this dataset, showing that the hybrid Recurrent Neural Networkbased approach, which combines single-frame and motion representations with soft-attention pooling strategy, yields the best generalization capability on MSR-VTT.
Author Yong Rui
Jun Xu
Ting Yao
Tao Mei
Author_xml – sequence: 1
  surname: Jun Xu
  fullname: Jun Xu
  email: v-junfu@microsoft.com
  organization: Microsoft Res., Beijing, China
– sequence: 2
  surname: Tao Mei
  fullname: Tao Mei
  email: tmei@microsoft.com
  organization: Microsoft Res., Beijing, China
– sequence: 3
  surname: Ting Yao
  fullname: Ting Yao
  email: tiyao@microsoft.com
  organization: Microsoft Res., Beijing, China
– sequence: 4
  surname: Yong Rui
  fullname: Yong Rui
  email: yongrui@microsoft.com
  organization: Microsoft Res., Beijing, China
BookMark eNotzE1Pg0AUheHRaGKtLF25mT8A3jsfzIy7Sv0MRlORbTPChYxRaAAX_feS2NXZPOc9Zydd3xFjlwgJIrjrrHzbJAIwTbTBIxY5Y1GlRlqrEY_ZAiGVcerQnbFoHL8AAF1q0boFe35538RlUdzwFc_90BIvQ009X9NYDWE3hb7jaz_5kSbe9AO_HULdhq49MN_V861rf31LF-y08d8jRYddso_7uyJ7jPPXh6dslccBjZ5irRSlaKGRotKgzWdF3hpvBepKKlJVU2uYRWNqqWZjbeVACCcckJCukUt29d8NRLTdDeHHD_utMRacAvkHiYdMMA
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/CVPR.2016.571
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Xplore Digital Library
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: Text complet a IEEE Electronic Library
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Applied Sciences
Computer Science
EISBN 9781467388511
1467388513
EISSN 1063-6919
EndPage 5296
ExternalDocumentID 7780940
Genre orig-research
GroupedDBID 23M
29F
29O
6IE
6IH
6IK
ABDPE
ACGFS
ALMA_UNASSIGNED_HOLDINGS
CBEJK
IPLJI
M43
RIE
RIO
RNS
ID FETCH-LOGICAL-i175t-544e6180f32c5057bcea87a8215c34e4cfd50e61f7d3432c88c90229290e239f3
IEDL.DBID RIE
IngestDate Wed Aug 27 01:54:53 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i175t-544e6180f32c5057bcea87a8215c34e4cfd50e61f7d3432c88c90229290e239f3
PageCount 9
ParticipantIDs ieee_primary_7780940
PublicationCentury 2000
PublicationDate 2016-June
PublicationDateYYYYMMDD 2016-06-01
PublicationDate_xml – month: 06
  year: 2016
  text: 2016-June
PublicationDecade 2010
PublicationTitle 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
PublicationTitleAbbrev CVPR
PublicationYear 2016
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0001968189
ssj0023720
ssj0003211698
Score 2.573374
Snippet While there has been increasing interest in the task of describing video with natural language, current computer vision algorithms are still severely limited...
SourceID ieee
SourceType Publisher
StartPage 5288
SubjectTerms Benchmark testing
Computer vision
Motion pictures
Recurrent neural networks
Visualization
Vocabulary
Title MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
URI https://ieeexplore.ieee.org/document/7780940
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwELZKJ6YCLeItD4w4TZ3EsdmgUFUVRVVpq25V7FykCilFkC78es55SoiBLTk5iuU87rvz990RcutCYplrnHHNQ4aQ2jDluZoFkTGgEm1LkFi2xasYL_3JOli3yF2thQGAnHwGjj3M9_LjndnbVFk_DKUt93ZADjBwK7RaTT5FCfQ9qj73MLIRqt5R4LYbS1Njsz9czeaW2CWcIFfPN51Vcscy6pBpNaWCT_Lu7DPtmO9f1Rr_O-cj0mskfHRWO6dj0oL0hHRKzEnLL_oLTVVbh8rWJZPp25ytFot7-kBfLFGcrrYx7CiGqNUvhj5FGbq_jCLkpY9W84U3KYdFaYyXFVnQHlmOnhfDMStbLrAt4oiMBb4PYiDdxOPGhi7aQCTDSCIwMJ4PvkniwMURSRhbRaqR0ihEAYixXOCeSrxT0k53KZwRKpQOdeKDFF7kC20bfJhBrKUrOEAwEOeka1dr81FU1diUC3Xxt_mSHNqnVZC0rkg7-9zDNcKBTN_k78EPH32vgg
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwELZKGWAq0CLeeGDEaZo4js0GhaqUtqpKWrFVsXORKqQUQbrw67HzlBADW3JyFMt53Hfn77tD6MaG2DDXHOJIxycaUisiXFsSL1QKRCxNCRLDtpiy4YKO3ry3BrqttDAAkJHPwDKH2V5-tFFbkyrr-j435d520K72-1Tkaq06oyKY9j6iOnd1bMNEtafgmH4sdZXNbn85mxtqF7O8TD9f91bJXMughSblpHJGybu1TaWlvn_Va_zvrA9Qpxbx4Vnlng5RA5Ij1CpQJy6-6S9tKhs7lLY2Gk1e52QZBHf4Ho8NVRwv1xFssA5Sy58MfgxT7QBTrEEvfjCqL32TYliYRPqyPA_aQYvBU9AfkqLpAllrJJESj1JgPW7HrqNM8CIVhNwPuYYGyqVAVRx5th4R-5HRpCrOldA4QKMsGxxXxO4xaiabBE4QZkL6MqbAmRtSJk2LD9WLJLeZA-D12Clqm9VafeR1NVbFQp39bb5Ge8NgMl6Nn6cv52jfPLmcsnWBmunnFi41OEjlVfZO_ACKnrLS
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2016+IEEE+Conference+on+Computer+Vision+and+Pattern+Recognition+%28CVPR%29&rft.atitle=MSR-VTT%3A+A+Large+Video+Description+Dataset+for+Bridging+Video+and+Language&rft.au=Jun+Xu&rft.au=Tao+Mei&rft.au=Ting+Yao&rft.au=Yong+Rui&rft.date=2016-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=5288&rft.epage=5296&rft_id=info:doi/10.1109%2FCVPR.2016.571&rft.externalDocID=7780940