Expanding Language-Image Pretrained Models for General Video Recognition
Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data, demonstrating remarkable “zero-shot” generalization ability for various image tasks. However, how to effectively expand such new language-image pretraining methods to v...
Saved in:
Published in | Computer Vision – ECCV 2022 pp. 1 - 18 |
---|---|
Main Authors | , , , , , , , |
Format | Book Chapter |
Language | English |
Published |
Cham
Springer Nature Switzerland
28.10.2022
|
Series | Lecture Notes in Computer Science |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data, demonstrating remarkable “zero-shot” generalization ability for various image tasks. However, how to effectively expand such new language-image pretraining methods to video domains is still an open problem. In this work, we present a simple yet effective approach that adapts the pretrained language-image models to video recognition directly, instead of pretraining a new model from scratch. More concretely, to capture the long-range dependencies of frames along the temporal dimension, we propose a cross-frame attention mechanism that explicitly exchanges information across frames. Such module is lightweight and can be plugged into pretrained language-image models seamlessly. Moreover, we propose a video-specific prompting scheme, which leverages video content information for generating discriminative textual prompts. Extensive experiments demonstrate that our approach is effective and can be generalized to different video recognition scenarios. In particular, under fully-supervised settings, our approach achieves a top-1 accuracy of 87.1% on Kinectics-400, while using 12× $$\times $$ fewer FLOPs compared with Swin-L and ViViT-H. In zero-shot experiments, our approach surpasses the current state-of-the-art methods by +7.6% and +14.9% in terms of top-1 accuracy under two popular protocols. In few-shot scenarios, our approach outperforms previous best methods by +32.1% and +23.1% when the labeled data is extremely limited. Code and models are available at here. |
---|---|
AbstractList | Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data, demonstrating remarkable “zero-shot” generalization ability for various image tasks. However, how to effectively expand such new language-image pretraining methods to video domains is still an open problem. In this work, we present a simple yet effective approach that adapts the pretrained language-image models to video recognition directly, instead of pretraining a new model from scratch. More concretely, to capture the long-range dependencies of frames along the temporal dimension, we propose a cross-frame attention mechanism that explicitly exchanges information across frames. Such module is lightweight and can be plugged into pretrained language-image models seamlessly. Moreover, we propose a video-specific prompting scheme, which leverages video content information for generating discriminative textual prompts. Extensive experiments demonstrate that our approach is effective and can be generalized to different video recognition scenarios. In particular, under fully-supervised settings, our approach achieves a top-1 accuracy of 87.1% on Kinectics-400, while using 12× $$\times $$ fewer FLOPs compared with Swin-L and ViViT-H. In zero-shot experiments, our approach surpasses the current state-of-the-art methods by +7.6% and +14.9% in terms of top-1 accuracy under two popular protocols. In few-shot scenarios, our approach outperforms previous best methods by +32.1% and +23.1% when the labeled data is extremely limited. Code and models are available at here. |
Author | Xiang, Shiming Ling, Haibin Zhang, Songyang Peng, Houwen Fu, Jianlong Ni, Bolin Chen, Minghao Meng, Gaofeng |
Author_xml | – sequence: 1 givenname: Bolin surname: Ni fullname: Ni, Bolin – sequence: 2 givenname: Houwen surname: Peng fullname: Peng, Houwen email: houwen.peng@microsoft.com – sequence: 3 givenname: Minghao surname: Chen fullname: Chen, Minghao – sequence: 4 givenname: Songyang surname: Zhang fullname: Zhang, Songyang – sequence: 5 givenname: Gaofeng surname: Meng fullname: Meng, Gaofeng email: gfmeng@nlpr.ia.ac.cn – sequence: 6 givenname: Jianlong surname: Fu fullname: Fu, Jianlong – sequence: 7 givenname: Shiming surname: Xiang fullname: Xiang, Shiming – sequence: 8 givenname: Haibin surname: Ling fullname: Ling, Haibin |
BookMark | eNpVkE1OwzAQhQ0UiVJ6Aja5gGEmTu1kiapSKhWBELC1_DOJAsWu4iJxHM7CyTCFDZsZ6b3R6L3vlI1CDMTYOcIFAqjLRtVccBDIsVGq5ErjAZtmVWRtL6lDNkaJyIWomqN_HsKIjUFAyRtViRM2TekFAEqVbYAxWy0-tib4PnTF2oTu3XTEV295FvcD7QbTB_LFbfS0SUUbh6_PJQUazKZ47j3F4oFc7EK_62M4Y8et2SSa_u0Je7pePM5v-PpuuZpfrXnKeXZcyhmiq8iqppW-btFbI71XYMsZCZSSmrYy-QKMLY23DhwpX7oaa7LOSTFh-Ps3bYccmwZtY3xNGkH_4NK5vBY619d7NjrjEt-hY1vx |
ContentType | Book Chapter |
Copyright | The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 |
Copyright_xml | – notice: The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 |
DOI | 10.1007/978-3-031-19772-7_1 |
DatabaseTitleList | |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Applied Sciences Computer Science |
EISBN | 9783031197727 3031197720 |
EISSN | 1611-3349 |
Editor | Farinella, Giovanni Maria Avidan, Shai Cissé, Moustapha Brostow, Gabriel Hassner, Tal |
Editor_xml | – sequence: 1 givenname: Shai surname: Avidan fullname: Avidan, Shai email: avidan@eng.tau.ac.il – sequence: 2 givenname: Gabriel orcidid: 0000-0001-8472-3828 surname: Brostow fullname: Brostow, Gabriel email: g.brostow@cs.ucl.ac.uk – sequence: 3 givenname: Moustapha surname: Cissé fullname: Cissé, Moustapha email: moustaphacisse@google.com – sequence: 4 givenname: Giovanni Maria orcidid: 0000-0002-6034-0432 surname: Farinella fullname: Farinella, Giovanni Maria email: gfarinella@dmi.unict.it – sequence: 5 givenname: Tal orcidid: 0000-0003-2275-1406 surname: Hassner fullname: Hassner, Tal email: talhassner@gmail.com |
EndPage | 18 |
GroupedDBID | -DT -~X 29L 2HA 2HV ACGFS ADCXD ALMA_UNASSIGNED_HOLDINGS EJD F5P LAS LDH P2P RSU ~02 |
ID | FETCH-LOGICAL-s197t-66511c4eb79f6d8f1dba6dd70b25e3166e9f4a1c40ab2adbc0ce7d2c818ebcc63 |
ISBN | 9783031197710 3031197712 |
ISSN | 0302-9743 |
IngestDate | Tue Jul 29 20:16:55 EDT 2025 |
IsPeerReviewed | true |
IsScholarly | true |
Language | English |
LinkModel | OpenURL |
MergedId | FETCHMERGED-LOGICAL-s197t-66511c4eb79f6d8f1dba6dd70b25e3166e9f4a1c40ab2adbc0ce7d2c818ebcc63 |
Notes | Supplementary InformationThe online version contains supplementary material available at https://doi.org/10.1007/978-3-031-19772-7_1. B. Ni and M. Chen—Work done during internship at Microsoft Research. Original Abstract: Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data, demonstrating remarkable “zero-shot” generalization ability for various image tasks. However, how to effectively expand such new language-image pretraining methods to video domains is still an open problem. In this work, we present a simple yet effective approach that adapts the pretrained language-image models to video recognition directly, instead of pretraining a new model from scratch. More concretely, to capture the long-range dependencies of frames along the temporal dimension, we propose a cross-frame attention mechanism that explicitly exchanges information across frames. Such module is lightweight and can be plugged into pretrained language-image models seamlessly. Moreover, we propose a video-specific prompting scheme, which leverages video content information for generating discriminative textual prompts. Extensive experiments demonstrate that our approach is effective and can be generalized to different video recognition scenarios. In particular, under fully-supervised settings, our approach achieves a top-1 accuracy of 87.1% on Kinectics-400, while using 12×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document} fewer FLOPs compared with Swin-L and ViViT-H. In zero-shot experiments, our approach surpasses the current state-of-the-art methods by +7.6% and +14.9% in terms of top-1 accuracy under two popular protocols. In few-shot scenarios, our approach outperforms previous best methods by +32.1% and +23.1% when the labeled data is extremely limited. Code and models are available at here. |
PageCount | 18 |
ParticipantIDs | springer_books_10_1007_978_3_031_19772_7_1 |
PublicationCentury | 2000 |
PublicationDate | 20221028 |
PublicationDateYYYYMMDD | 2022-10-28 |
PublicationDate_xml | – month: 10 year: 2022 text: 20221028 day: 28 |
PublicationDecade | 2020 |
PublicationPlace | Cham |
PublicationPlace_xml | – name: Cham |
PublicationSeriesTitle | Lecture Notes in Computer Science |
PublicationSeriesTitleAlternate | Lect.Notes Computer |
PublicationSubtitle | 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV |
PublicationTitle | Computer Vision – ECCV 2022 |
PublicationYear | 2022 |
Publisher | Springer Nature Switzerland |
Publisher_xml | – name: Springer Nature Switzerland |
RelatedPersons | Hartmanis, Juris Gao, Wen Steffen, Bernhard Bertino, Elisa Goos, Gerhard Yung, Moti |
RelatedPersons_xml | – sequence: 1 givenname: Gerhard surname: Goos fullname: Goos, Gerhard – sequence: 2 givenname: Juris surname: Hartmanis fullname: Hartmanis, Juris – sequence: 3 givenname: Elisa surname: Bertino fullname: Bertino, Elisa – sequence: 4 givenname: Wen surname: Gao fullname: Gao, Wen – sequence: 5 givenname: Bernhard orcidid: 0000-0001-9619-1558 surname: Steffen fullname: Steffen, Bernhard – sequence: 6 givenname: Moti orcidid: 0000-0003-0848-0873 surname: Yung fullname: Yung, Moti |
SSID | ssj0002731100 ssj0002792 |
Score | 2.5353372 |
Snippet | Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data, demonstrating remarkable... |
SourceID | springer |
SourceType | Publisher |
StartPage | 1 |
SubjectTerms | Contrastive language-image pretraining Video recognition |
Title | Expanding Language-Image Pretrained Models for General Video Recognition |
URI | http://link.springer.com/10.1007/978-3-031-19772-7_1 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3JbtRAEG0NwwVxYBe7-kAuWEZeu-0Dh2ANmkSTuZCMcrN6JUgwljIeReQX-Am-hS-jevEyGS7hYlkty27Xs6urqutVIfSOpBkRMtMhZyUFByXOQ0aoCDNBuSwKVvDcsJFPlmR-lh2f5-eTya9R1tK25R_E9T95Jf-DKowBroYlewtk-5vCAJwDvnAEhOF4w_jdDbO6ugK-H0OwsvTwoEtbSINZVa2CJEr6xOGl3bL_ZPrzDKrQ_eTzZns1sMEqz9U4gQXtgjX7UeVm_fUn84sd67P4PDVm4WOf4dEPmwl0qWwDCjBpTce177byw0GVHBxGvtg1TF2qxpiuLonJfyJGdmrzceG3N5ZNa7PGgv6NvUIaRyzA2QVV7xng44ilScc2d_ly9a29dtTmHecWFlezx0l92qsneYECBxfI6UTldDYxlRhTV_nU6-F4tKA7_b63VIyzQ-BJoXkUOBs1uNJ3aJFP0d3D2fFi1QfswM4z5fX6Zd5UXnRbVG5GhjjUzThxpZ2GN-jrXbmSxjeeuLcLb42b04foviG8YMNEAeE-QhO1foweePcEe1lvYKiTfzf2BB314ONd8PEAPnbgYwD_z28PPLbA4xHwT9HZ59lpNQ99c45wA_NuQ0LAVBeZ4rTURBY6lpwRKWnEk1ylMSGq1BmDKyLGEya5iISiMhFgICouBEmfoem6WavnCGtaFrnp4xDrNJOxLjUTMJASRkvJYv0Cve_EU5vfbVN3tbZBlnVagyxrK8saZPnyNhe_QveGD_Q1mraXW_UGjMyWv_Xw_wXcnXK4 |
linkProvider | Library Specific Holdings |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=bookitem&rft.title=Computer+Vision+%E2%80%93+ECCV+2022&rft.au=Ni%2C+Bolin&rft.au=Peng%2C+Houwen&rft.au=Chen%2C+Minghao&rft.au=Zhang%2C+Songyang&rft.atitle=Expanding+Language-Image+Pretrained+Models+for%C2%A0General+Video+Recognition&rft.series=Lecture+Notes+in+Computer+Science&rft.date=2022-10-28&rft.pub=Springer+Nature+Switzerland&rft.isbn=9783031197710&rft.issn=0302-9743&rft.eissn=1611-3349&rft.spage=1&rft.epage=18&rft_id=info:doi/10.1007%2F978-3-031-19772-7_1 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0302-9743&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0302-9743&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0302-9743&client=summon |