MSR-VTT: A Large Video Description Dataset for Bridging Video and Language

While there has been increasing interest in the task of describing video with natural language, current computer vision algorithms are still severely limited in terms of the variability and complexity of the videos and their associated language that they can recognize. This is in part due to the sim...

Full description

Saved in:

Bibliographic Details
Published in	2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 5288 - 5296
Main Authors	Jun Xu, Tao Mei, Ting Yao, Yong Rui
Format	Conference Proceeding
Language	English
Published	IEEE 01.06.2016
Subjects	Benchmark testing Computer vision Motion pictures Recurrent neural networks Visualization Vocabulary
Online Access	Get full text
ISSN	1063-6919
DOI	10.1109/CVPR.2016.571

Cover

Loading…

Abstract	While there has been increasing interest in the task of describing video with natural language, current computer vision algorithms are still severely limited in terms of the variability and complexity of the videos and their associated language that they can recognize. This is in part due to the simplicity of current benchmarks, which mostly focus on specific fine-grained domains with limited videos and simple descriptions. While researchers have provided several benchmark datasets for image captioning, we are not aware of any large-scale video description dataset with comprehensive categories yet diverse video content. In this paper we present MSR-VTT (standing for "MSRVideo to Text") which is a new large-scale video benchmark for video understanding, especially the emerging task of translating video to text. This is achieved by collecting 257 popular queries from a commercial video search engine, with 118 videos for each query. In its current version, MSR-VTT provides 10K web video clips with 41.2 hours and 200K clip-sentence pairs in total, covering the most comprehensive categories and diverse visual content, and representing the largest dataset in terms of sentence and vocabulary. Each clip is annotated with about 20 natural sentences by 1,327 AMT workers. We present a detailed analysis of MSR-VTT in comparison to a complete set of existing datasets, together with a summarization of different state-of-the-art video-to-text approaches. We also provide an extensive evaluation of these approaches on this dataset, showing that the hybrid Recurrent Neural Networkbased approach, which combines single-frame and motion representations with soft-attention pooling strategy, yields the best generalization capability on MSR-VTT.
AbstractList	While there has been increasing interest in the task of describing video with natural language, current computer vision algorithms are still severely limited in terms of the variability and complexity of the videos and their associated language that they can recognize. This is in part due to the simplicity of current benchmarks, which mostly focus on specific fine-grained domains with limited videos and simple descriptions. While researchers have provided several benchmark datasets for image captioning, we are not aware of any large-scale video description dataset with comprehensive categories yet diverse video content. In this paper we present MSR-VTT (standing for "MSRVideo to Text") which is a new large-scale video benchmark for video understanding, especially the emerging task of translating video to text. This is achieved by collecting 257 popular queries from a commercial video search engine, with 118 videos for each query. In its current version, MSR-VTT provides 10K web video clips with 41.2 hours and 200K clip-sentence pairs in total, covering the most comprehensive categories and diverse visual content, and representing the largest dataset in terms of sentence and vocabulary. Each clip is annotated with about 20 natural sentences by 1,327 AMT workers. We present a detailed analysis of MSR-VTT in comparison to a complete set of existing datasets, together with a summarization of different state-of-the-art video-to-text approaches. We also provide an extensive evaluation of these approaches on this dataset, showing that the hybrid Recurrent Neural Networkbased approach, which combines single-frame and motion representations with soft-attention pooling strategy, yields the best generalization capability on MSR-VTT.
Author	Yong Rui Jun Xu Ting Yao Tao Mei
Author_xml	– sequence: 1 surname: Jun Xu fullname: Jun Xu email: v-junfu@microsoft.com organization: Microsoft Res., Beijing, China – sequence: 2 surname: Tao Mei fullname: Tao Mei email: tmei@microsoft.com organization: Microsoft Res., Beijing, China – sequence: 3 surname: Ting Yao fullname: Ting Yao email: tiyao@microsoft.com organization: Microsoft Res., Beijing, China – sequence: 4 surname: Yong Rui fullname: Yong Rui email: yongrui@microsoft.com organization: Microsoft Res., Beijing, China
BookMark	eNotzE1Pg0AUheHRaGKtLF25mT8A3jsfzIy7Sv0MRlORbTPChYxRaAAX_feS2NXZPOc9Zydd3xFjlwgJIrjrrHzbJAIwTbTBIxY5Y1GlRlqrEY_ZAiGVcerQnbFoHL8AAF1q0boFe35538RlUdzwFc_90BIvQ009X9NYDWE3hb7jaz_5kSbe9AO_HULdhq49MN_V861rf31LF-y08d8jRYddso_7uyJ7jPPXh6dslccBjZ5irRSlaKGRotKgzWdF3hpvBepKKlJVU2uYRWNqqWZjbeVACCcckJCukUt29d8NRLTdDeHHD_utMRacAvkHiYdMMA
CODEN	IEEPAD
ContentType	Conference Proceeding
DBID	6IE 6IH CBEJK RIE RIO
DOI	10.1109/CVPR.2016.571
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Xplore Digital Library IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: Text complet a IEEE Electronic Library url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Applied Sciences Computer Science
EISBN	9781467388511 1467388513
EISSN	1063-6919
EndPage	5296
ExternalDocumentID	7780940
Genre	orig-research
GroupedDBID	23M 29F 29O 6IE 6IH 6IK ABDPE ACGFS ALMA_UNASSIGNED_HOLDINGS CBEJK IPLJI M43 RIE RIO RNS
ID	FETCH-LOGICAL-i175t-544e6180f32c5057bcea87a8215c34e4cfd50e61f7d3432c88c90229290e239f3
IEDL.DBID	RIE
IngestDate	Wed Aug 27 01:54:53 EDT 2025
IsPeerReviewed	false
IsScholarly	true
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i175t-544e6180f32c5057bcea87a8215c34e4cfd50e61f7d3432c88c90229290e239f3
PageCount	9
ParticipantIDs	ieee_primary_7780940
PublicationCentury	2000
PublicationDate	2016-June
PublicationDateYYYYMMDD	2016-06-01
PublicationDate_xml	– month: 06 year: 2016 text: 2016-June
PublicationDecade	2010
PublicationTitle	2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
PublicationTitleAbbrev	CVPR
PublicationYear	2016
Publisher	IEEE
Publisher_xml	– name: IEEE
SSID	ssj0001968189 ssj0023720 ssj0003211698
Score	2.573374
Snippet	While there has been increasing interest in the task of describing video with natural language, current computer vision algorithms are still severely limited...
SourceID	ieee
SourceType	Publisher
StartPage	5288
SubjectTerms	Benchmark testing Computer vision Motion pictures Recurrent neural networks Visualization Vocabulary
Title	MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
URI	https://ieeexplore.ieee.org/document/7780940
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwELZKJ6YCLeItD4w4TZ3EsdmgUFUVRVVpq25V7FykCilFkC78es55SoiBLTk5iuU87rvz990RcutCYplrnHHNQ4aQ2jDluZoFkTGgEm1LkFi2xasYL_3JOli3yF2thQGAnHwGjj3M9_LjndnbVFk_DKUt93ZADjBwK7RaTT5FCfQ9qj73MLIRqt5R4LYbS1Njsz9czeaW2CWcIFfPN51Vcscy6pBpNaWCT_Lu7DPtmO9f1Rr_O-cj0mskfHRWO6dj0oL0hHRKzEnLL_oLTVVbh8rWJZPp25ytFot7-kBfLFGcrrYx7CiGqNUvhj5FGbq_jCLkpY9W84U3KYdFaYyXFVnQHlmOnhfDMStbLrAt4oiMBb4PYiDdxOPGhi7aQCTDSCIwMJ4PvkniwMURSRhbRaqR0ihEAYixXOCeSrxT0k53KZwRKpQOdeKDFF7kC20bfJhBrKUrOEAwEOeka1dr81FU1diUC3Xxt_mSHNqnVZC0rkg7-9zDNcKBTN_k78EPH32vgg
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwELZKGWAq0CLeeGDEaZo4js0GhaqUtqpKWrFVsXORKqQUQbrw67HzlBADW3JyFMt53Hfn77tD6MaG2DDXHOJIxycaUisiXFsSL1QKRCxNCRLDtpiy4YKO3ry3BrqttDAAkJHPwDKH2V5-tFFbkyrr-j435d520K72-1Tkaq06oyKY9j6iOnd1bMNEtafgmH4sdZXNbn85mxtqF7O8TD9f91bJXMughSblpHJGybu1TaWlvn_Va_zvrA9Qpxbx4Vnlng5RA5Ij1CpQJy6-6S9tKhs7lLY2Gk1e52QZBHf4Ho8NVRwv1xFssA5Sy58MfgxT7QBTrEEvfjCqL32TYliYRPqyPA_aQYvBU9AfkqLpAllrJJESj1JgPW7HrqNM8CIVhNwPuYYGyqVAVRx5th4R-5HRpCrOldA4QKMsGxxXxO4xaiabBE4QZkL6MqbAmRtSJk2LD9WLJLeZA-D12Clqm9VafeR1NVbFQp39bb5Ge8NgMl6Nn6cv52jfPLmcsnWBmunnFi41OEjlVfZO_ACKnrLS
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2016+IEEE+Conference+on+Computer+Vision+and+Pattern+Recognition+%28CVPR%29&rft.atitle=MSR-VTT%3A+A+Large+Video+Description+Dataset+for+Bridging+Video+and+Language&rft.au=Jun+Xu&rft.au=Tao+Mei&rft.au=Ting+Yao&rft.au=Yong+Rui&rft.date=2016-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=5288&rft.epage=5296&rft_id=info:doi/10.1109%2FCVPR.2016.571&rft.externalDocID=7780940