MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
While there has been increasing interest in the task of describing video with natural language, current computer vision algorithms are still severely limited in terms of the variability and complexity of the videos and their associated language that they can recognize. This is in part due to the sim...
Saved in:
Published in | 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 5288 - 5296 |
---|---|
Main Authors | , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.06.2016
|
Subjects | |
Online Access | Get full text |
ISSN | 1063-6919 |
DOI | 10.1109/CVPR.2016.571 |
Cover
Loading…
Abstract | While there has been increasing interest in the task of describing video with natural language, current computer vision algorithms are still severely limited in terms of the variability and complexity of the videos and their associated language that they can recognize. This is in part due to the simplicity of current benchmarks, which mostly focus on specific fine-grained domains with limited videos and simple descriptions. While researchers have provided several benchmark datasets for image captioning, we are not aware of any large-scale video description dataset with comprehensive categories yet diverse video content. In this paper we present MSR-VTT (standing for "MSRVideo to Text") which is a new large-scale video benchmark for video understanding, especially the emerging task of translating video to text. This is achieved by collecting 257 popular queries from a commercial video search engine, with 118 videos for each query. In its current version, MSR-VTT provides 10K web video clips with 41.2 hours and 200K clip-sentence pairs in total, covering the most comprehensive categories and diverse visual content, and representing the largest dataset in terms of sentence and vocabulary. Each clip is annotated with about 20 natural sentences by 1,327 AMT workers. We present a detailed analysis of MSR-VTT in comparison to a complete set of existing datasets, together with a summarization of different state-of-the-art video-to-text approaches. We also provide an extensive evaluation of these approaches on this dataset, showing that the hybrid Recurrent Neural Networkbased approach, which combines single-frame and motion representations with soft-attention pooling strategy, yields the best generalization capability on MSR-VTT. |
---|---|
AbstractList | While there has been increasing interest in the task of describing video with natural language, current computer vision algorithms are still severely limited in terms of the variability and complexity of the videos and their associated language that they can recognize. This is in part due to the simplicity of current benchmarks, which mostly focus on specific fine-grained domains with limited videos and simple descriptions. While researchers have provided several benchmark datasets for image captioning, we are not aware of any large-scale video description dataset with comprehensive categories yet diverse video content. In this paper we present MSR-VTT (standing for "MSRVideo to Text") which is a new large-scale video benchmark for video understanding, especially the emerging task of translating video to text. This is achieved by collecting 257 popular queries from a commercial video search engine, with 118 videos for each query. In its current version, MSR-VTT provides 10K web video clips with 41.2 hours and 200K clip-sentence pairs in total, covering the most comprehensive categories and diverse visual content, and representing the largest dataset in terms of sentence and vocabulary. Each clip is annotated with about 20 natural sentences by 1,327 AMT workers. We present a detailed analysis of MSR-VTT in comparison to a complete set of existing datasets, together with a summarization of different state-of-the-art video-to-text approaches. We also provide an extensive evaluation of these approaches on this dataset, showing that the hybrid Recurrent Neural Networkbased approach, which combines single-frame and motion representations with soft-attention pooling strategy, yields the best generalization capability on MSR-VTT. |
Author | Yong Rui Jun Xu Ting Yao Tao Mei |
Author_xml | – sequence: 1 surname: Jun Xu fullname: Jun Xu email: v-junfu@microsoft.com organization: Microsoft Res., Beijing, China – sequence: 2 surname: Tao Mei fullname: Tao Mei email: tmei@microsoft.com organization: Microsoft Res., Beijing, China – sequence: 3 surname: Ting Yao fullname: Ting Yao email: tiyao@microsoft.com organization: Microsoft Res., Beijing, China – sequence: 4 surname: Yong Rui fullname: Yong Rui email: yongrui@microsoft.com organization: Microsoft Res., Beijing, China |
BookMark | eNotzE1Pg0AUheHRaGKtLF25mT8A3jsfzIy7Sv0MRlORbTPChYxRaAAX_feS2NXZPOc9Zydd3xFjlwgJIrjrrHzbJAIwTbTBIxY5Y1GlRlqrEY_ZAiGVcerQnbFoHL8AAF1q0boFe35538RlUdzwFc_90BIvQ009X9NYDWE3hb7jaz_5kSbe9AO_HULdhq49MN_V861rf31LF-y08d8jRYddso_7uyJ7jPPXh6dslccBjZ5irRSlaKGRotKgzWdF3hpvBepKKlJVU2uYRWNqqWZjbeVACCcckJCukUt29d8NRLTdDeHHD_utMRacAvkHiYdMMA |
CODEN | IEEPAD |
ContentType | Conference Proceeding |
DBID | 6IE 6IH CBEJK RIE RIO |
DOI | 10.1109/CVPR.2016.571 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Xplore Digital Library IEEE Proceedings Order Plans (POP) 1998-present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: Text complet a IEEE Electronic Library url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Applied Sciences Computer Science |
EISBN | 9781467388511 1467388513 |
EISSN | 1063-6919 |
EndPage | 5296 |
ExternalDocumentID | 7780940 |
Genre | orig-research |
GroupedDBID | 23M 29F 29O 6IE 6IH 6IK ABDPE ACGFS ALMA_UNASSIGNED_HOLDINGS CBEJK IPLJI M43 RIE RIO RNS |
ID | FETCH-LOGICAL-i175t-544e6180f32c5057bcea87a8215c34e4cfd50e61f7d3432c88c90229290e239f3 |
IEDL.DBID | RIE |
IngestDate | Wed Aug 27 01:54:53 EDT 2025 |
IsPeerReviewed | false |
IsScholarly | true |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-i175t-544e6180f32c5057bcea87a8215c34e4cfd50e61f7d3432c88c90229290e239f3 |
PageCount | 9 |
ParticipantIDs | ieee_primary_7780940 |
PublicationCentury | 2000 |
PublicationDate | 2016-June |
PublicationDateYYYYMMDD | 2016-06-01 |
PublicationDate_xml | – month: 06 year: 2016 text: 2016-June |
PublicationDecade | 2010 |
PublicationTitle | 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) |
PublicationTitleAbbrev | CVPR |
PublicationYear | 2016 |
Publisher | IEEE |
Publisher_xml | – name: IEEE |
SSID | ssj0001968189 ssj0023720 ssj0003211698 |
Score | 2.573374 |
Snippet | While there has been increasing interest in the task of describing video with natural language, current computer vision algorithms are still severely limited... |
SourceID | ieee |
SourceType | Publisher |
StartPage | 5288 |
SubjectTerms | Benchmark testing Computer vision Motion pictures Recurrent neural networks Visualization Vocabulary |
Title | MSR-VTT: A Large Video Description Dataset for Bridging Video and Language |
URI | https://ieeexplore.ieee.org/document/7780940 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwELZKJ6YCLeItD4w4TZ3EsdmgUFUVRVVpq25V7FykCilFkC78es55SoiBLTk5iuU87rvz990RcutCYplrnHHNQ4aQ2jDluZoFkTGgEm1LkFi2xasYL_3JOli3yF2thQGAnHwGjj3M9_LjndnbVFk_DKUt93ZADjBwK7RaTT5FCfQ9qj73MLIRqt5R4LYbS1Njsz9czeaW2CWcIFfPN51Vcscy6pBpNaWCT_Lu7DPtmO9f1Rr_O-cj0mskfHRWO6dj0oL0hHRKzEnLL_oLTVVbh8rWJZPp25ytFot7-kBfLFGcrrYx7CiGqNUvhj5FGbq_jCLkpY9W84U3KYdFaYyXFVnQHlmOnhfDMStbLrAt4oiMBb4PYiDdxOPGhi7aQCTDSCIwMJ4PvkniwMURSRhbRaqR0ihEAYixXOCeSrxT0k53KZwRKpQOdeKDFF7kC20bfJhBrKUrOEAwEOeka1dr81FU1diUC3Xxt_mSHNqnVZC0rkg7-9zDNcKBTN_k78EPH32vgg |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwELZKGWAq0CLeeGDEaZo4js0GhaqUtqpKWrFVsXORKqQUQbrw67HzlBADW3JyFMt53Hfn77tD6MaG2DDXHOJIxycaUisiXFsSL1QKRCxNCRLDtpiy4YKO3ry3BrqttDAAkJHPwDKH2V5-tFFbkyrr-j435d520K72-1Tkaq06oyKY9j6iOnd1bMNEtafgmH4sdZXNbn85mxtqF7O8TD9f91bJXMughSblpHJGybu1TaWlvn_Va_zvrA9Qpxbx4Vnlng5RA5Ij1CpQJy6-6S9tKhs7lLY2Gk1e52QZBHf4Ho8NVRwv1xFssA5Sy58MfgxT7QBTrEEvfjCqL32TYliYRPqyPA_aQYvBU9AfkqLpAllrJJESj1JgPW7HrqNM8CIVhNwPuYYGyqVAVRx5th4R-5HRpCrOldA4QKMsGxxXxO4xaiabBE4QZkL6MqbAmRtSJk2LD9WLJLeZA-D12Clqm9VafeR1NVbFQp39bb5Ge8NgMl6Nn6cv52jfPLmcsnWBmunnFi41OEjlVfZO_ACKnrLS |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2016+IEEE+Conference+on+Computer+Vision+and+Pattern+Recognition+%28CVPR%29&rft.atitle=MSR-VTT%3A+A+Large+Video+Description+Dataset+for+Bridging+Video+and+Language&rft.au=Jun+Xu&rft.au=Tao+Mei&rft.au=Ting+Yao&rft.au=Yong+Rui&rft.date=2016-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=5288&rft.epage=5296&rft_id=info:doi/10.1109%2FCVPR.2016.571&rft.externalDocID=7780940 |