Expanding Language-Image Pretrained Models for General Video Recognition

Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data, demonstrating remarkable “zero-shot” generalization ability for various image tasks. However, how to effectively expand such new language-image pretraining methods to v...

Full description

Saved in:

Bibliographic Details
Published in	Computer Vision – ECCV 2022 pp. 1 - 18
Main Authors	Ni, Bolin, Peng, Houwen, Chen, Minghao, Zhang, Songyang, Meng, Gaofeng, Fu, Jianlong, Xiang, Shiming, Ling, Haibin
Format	Book Chapter
Language	English
Published	Cham Springer Nature Switzerland 28.10.2022
Series	Lecture Notes in Computer Science
Subjects	Contrastive language-image pretraining Video recognition
Online Access	Get full text

Cover

Loading…

Abstract	Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data, demonstrating remarkable “zero-shot” generalization ability for various image tasks. However, how to effectively expand such new language-image pretraining methods to video domains is still an open problem. In this work, we present a simple yet effective approach that adapts the pretrained language-image models to video recognition directly, instead of pretraining a new model from scratch. More concretely, to capture the long-range dependencies of frames along the temporal dimension, we propose a cross-frame attention mechanism that explicitly exchanges information across frames. Such module is lightweight and can be plugged into pretrained language-image models seamlessly. Moreover, we propose a video-specific prompting scheme, which leverages video content information for generating discriminative textual prompts. Extensive experiments demonstrate that our approach is effective and can be generalized to different video recognition scenarios. In particular, under fully-supervised settings, our approach achieves a top-1 accuracy of 87.1% on Kinectics-400, while using 12× $$\times $$ fewer FLOPs compared with Swin-L and ViViT-H. In zero-shot experiments, our approach surpasses the current state-of-the-art methods by +7.6% and +14.9% in terms of top-1 accuracy under two popular protocols. In few-shot scenarios, our approach outperforms previous best methods by +32.1% and +23.1% when the labeled data is extremely limited. Code and models are available at here.
AbstractList	Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data, demonstrating remarkable “zero-shot” generalization ability for various image tasks. However, how to effectively expand such new language-image pretraining methods to video domains is still an open problem. In this work, we present a simple yet effective approach that adapts the pretrained language-image models to video recognition directly, instead of pretraining a new model from scratch. More concretely, to capture the long-range dependencies of frames along the temporal dimension, we propose a cross-frame attention mechanism that explicitly exchanges information across frames. Such module is lightweight and can be plugged into pretrained language-image models seamlessly. Moreover, we propose a video-specific prompting scheme, which leverages video content information for generating discriminative textual prompts. Extensive experiments demonstrate that our approach is effective and can be generalized to different video recognition scenarios. In particular, under fully-supervised settings, our approach achieves a top-1 accuracy of 87.1% on Kinectics-400, while using 12× $$\times $$ fewer FLOPs compared with Swin-L and ViViT-H. In zero-shot experiments, our approach surpasses the current state-of-the-art methods by +7.6% and +14.9% in terms of top-1 accuracy under two popular protocols. In few-shot scenarios, our approach outperforms previous best methods by +32.1% and +23.1% when the labeled data is extremely limited. Code and models are available at here.
Author	Xiang, Shiming Ling, Haibin Zhang, Songyang Peng, Houwen Fu, Jianlong Ni, Bolin Chen, Minghao Meng, Gaofeng
Author_xml	– sequence: 1 givenname: Bolin surname: Ni fullname: Ni, Bolin – sequence: 2 givenname: Houwen surname: Peng fullname: Peng, Houwen email: houwen.peng@microsoft.com – sequence: 3 givenname: Minghao surname: Chen fullname: Chen, Minghao – sequence: 4 givenname: Songyang surname: Zhang fullname: Zhang, Songyang – sequence: 5 givenname: Gaofeng surname: Meng fullname: Meng, Gaofeng email: gfmeng@nlpr.ia.ac.cn – sequence: 6 givenname: Jianlong surname: Fu fullname: Fu, Jianlong – sequence: 7 givenname: Shiming surname: Xiang fullname: Xiang, Shiming – sequence: 8 givenname: Haibin surname: Ling fullname: Ling, Haibin
BookMark	eNpVkE1OwzAQhQ0UiVJ6Aja5gGEmTu1kiapSKhWBELC1_DOJAsWu4iJxHM7CyTCFDZsZ6b3R6L3vlI1CDMTYOcIFAqjLRtVccBDIsVGq5ErjAZtmVWRtL6lDNkaJyIWomqN_HsKIjUFAyRtViRM2TekFAEqVbYAxWy0-tib4PnTF2oTu3XTEV295FvcD7QbTB_LFbfS0SUUbh6_PJQUazKZ47j3F4oFc7EK_62M4Y8et2SSa_u0Je7pePM5v-PpuuZpfrXnKeXZcyhmiq8iqppW-btFbI71XYMsZCZSSmrYy-QKMLY23DhwpX7oaa7LOSTFh-Ps3bYccmwZtY3xNGkH_4NK5vBY619d7NjrjEt-hY1vx
ContentType	Book Chapter
Copyright	The Author(s), under exclusive license to Springer Nature Switzerland AG 2022
Copyright_xml	– notice: The Author(s), under exclusive license to Springer Nature Switzerland AG 2022
DOI	10.1007/978-3-031-19772-7_1
DatabaseTitleList
DeliveryMethod	fulltext_linktorsrc
Discipline	Applied Sciences Computer Science
EISBN	9783031197727 3031197720
EISSN	1611-3349
Editor	Farinella, Giovanni Maria Avidan, Shai Cissé, Moustapha Brostow, Gabriel Hassner, Tal
Editor_xml	– sequence: 1 givenname: Shai surname: Avidan fullname: Avidan, Shai email: avidan@eng.tau.ac.il – sequence: 2 givenname: Gabriel orcidid: 0000-0001-8472-3828 surname: Brostow fullname: Brostow, Gabriel email: g.brostow@cs.ucl.ac.uk – sequence: 3 givenname: Moustapha surname: Cissé fullname: Cissé, Moustapha email: moustaphacisse@google.com – sequence: 4 givenname: Giovanni Maria orcidid: 0000-0002-6034-0432 surname: Farinella fullname: Farinella, Giovanni Maria email: gfarinella@dmi.unict.it – sequence: 5 givenname: Tal orcidid: 0000-0003-2275-1406 surname: Hassner fullname: Hassner, Tal email: talhassner@gmail.com
EndPage	18
GroupedDBID	-DT -~X 29L 2HA 2HV ACGFS ADCXD ALMA_UNASSIGNED_HOLDINGS EJD F5P LAS LDH P2P RSU ~02
ID	FETCH-LOGICAL-s197t-66511c4eb79f6d8f1dba6dd70b25e3166e9f4a1c40ab2adbc0ce7d2c818ebcc63
ISBN	9783031197710 3031197712
ISSN	0302-9743
IngestDate	Tue Jul 29 20:16:55 EDT 2025
IsPeerReviewed	true
IsScholarly	true
Language	English
LinkModel	OpenURL
MergedId	FETCHMERGED-LOGICAL-s197t-66511c4eb79f6d8f1dba6dd70b25e3166e9f4a1c40ab2adbc0ce7d2c818ebcc63
Notes	Supplementary InformationThe online version contains supplementary material available at https://doi.org/10.1007/978-3-031-19772-7_1. B. Ni and M. Chen—Work done during internship at Microsoft Research. Original Abstract: Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data, demonstrating remarkable “zero-shot” generalization ability for various image tasks. However, how to effectively expand such new language-image pretraining methods to video domains is still an open problem. In this work, we present a simple yet effective approach that adapts the pretrained language-image models to video recognition directly, instead of pretraining a new model from scratch. More concretely, to capture the long-range dependencies of frames along the temporal dimension, we propose a cross-frame attention mechanism that explicitly exchanges information across frames. Such module is lightweight and can be plugged into pretrained language-image models seamlessly. Moreover, we propose a video-specific prompting scheme, which leverages video content information for generating discriminative textual prompts. Extensive experiments demonstrate that our approach is effective and can be generalized to different video recognition scenarios. In particular, under fully-supervised settings, our approach achieves a top-1 accuracy of 87.1% on Kinectics-400, while using 12×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document} fewer FLOPs compared with Swin-L and ViViT-H. In zero-shot experiments, our approach surpasses the current state-of-the-art methods by +7.6% and +14.9% in terms of top-1 accuracy under two popular protocols. In few-shot scenarios, our approach outperforms previous best methods by +32.1% and +23.1% when the labeled data is extremely limited. Code and models are available at here.
PageCount	18
ParticipantIDs	springer_books_10_1007_978_3_031_19772_7_1
PublicationCentury	2000
PublicationDate	20221028
PublicationDateYYYYMMDD	2022-10-28
PublicationDate_xml	– month: 10 year: 2022 text: 20221028 day: 28
PublicationDecade	2020
PublicationPlace	Cham
PublicationPlace_xml	– name: Cham
PublicationSeriesTitle	Lecture Notes in Computer Science
PublicationSeriesTitleAlternate	Lect.Notes Computer
PublicationSubtitle	17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV
PublicationTitle	Computer Vision – ECCV 2022
PublicationYear	2022
Publisher	Springer Nature Switzerland
Publisher_xml	– name: Springer Nature Switzerland
RelatedPersons	Hartmanis, Juris Gao, Wen Steffen, Bernhard Bertino, Elisa Goos, Gerhard Yung, Moti
RelatedPersons_xml	– sequence: 1 givenname: Gerhard surname: Goos fullname: Goos, Gerhard – sequence: 2 givenname: Juris surname: Hartmanis fullname: Hartmanis, Juris – sequence: 3 givenname: Elisa surname: Bertino fullname: Bertino, Elisa – sequence: 4 givenname: Wen surname: Gao fullname: Gao, Wen – sequence: 5 givenname: Bernhard orcidid: 0000-0001-9619-1558 surname: Steffen fullname: Steffen, Bernhard – sequence: 6 givenname: Moti orcidid: 0000-0003-0848-0873 surname: Yung fullname: Yung, Moti
SSID	ssj0002731100 ssj0002792
Score	2.5353372
Snippet	Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data, demonstrating remarkable...
SourceID	springer
SourceType	Publisher
StartPage	1
SubjectTerms	Contrastive language-image pretraining Video recognition
Title	Expanding Language-Image Pretrained Models for General Video Recognition
URI	http://link.springer.com/10.1007/978-3-031-19772-7_1
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3JbtRAEG0NwwVxYBe7-kAuWEZeu-0Dh2ANmkSTuZCMcrN6JUgwljIeReQX-Am-hS-jevEyGS7hYlkty27Xs6urqutVIfSOpBkRMtMhZyUFByXOQ0aoCDNBuSwKVvDcsJFPlmR-lh2f5-eTya9R1tK25R_E9T95Jf-DKowBroYlewtk-5vCAJwDvnAEhOF4w_jdDbO6ugK-H0OwsvTwoEtbSINZVa2CJEr6xOGl3bL_ZPrzDKrQ_eTzZns1sMEqz9U4gQXtgjX7UeVm_fUn84sd67P4PDVm4WOf4dEPmwl0qWwDCjBpTce177byw0GVHBxGvtg1TF2qxpiuLonJfyJGdmrzceG3N5ZNa7PGgv6NvUIaRyzA2QVV7xng44ilScc2d_ly9a29dtTmHecWFlezx0l92qsneYECBxfI6UTldDYxlRhTV_nU6-F4tKA7_b63VIyzQ-BJoXkUOBs1uNJ3aJFP0d3D2fFi1QfswM4z5fX6Zd5UXnRbVG5GhjjUzThxpZ2GN-jrXbmSxjeeuLcLb42b04foviG8YMNEAeE-QhO1foweePcEe1lvYKiTfzf2BB314ONd8PEAPnbgYwD_z28PPLbA4xHwT9HZ59lpNQ99c45wA_NuQ0LAVBeZ4rTURBY6lpwRKWnEk1ylMSGq1BmDKyLGEya5iISiMhFgICouBEmfoem6WavnCGtaFrnp4xDrNJOxLjUTMJASRkvJYv0Cve_EU5vfbVN3tbZBlnVagyxrK8saZPnyNhe_QveGD_Q1mraXW_UGjMyWv_Xw_wXcnXK4
linkProvider	Library Specific Holdings
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=bookitem&rft.title=Computer+Vision+%E2%80%93+ECCV+2022&rft.au=Ni%2C+Bolin&rft.au=Peng%2C+Houwen&rft.au=Chen%2C+Minghao&rft.au=Zhang%2C+Songyang&rft.atitle=Expanding+Language-Image+Pretrained+Models+for%C2%A0General+Video+Recognition&rft.series=Lecture+Notes+in+Computer+Science&rft.date=2022-10-28&rft.pub=Springer+Nature+Switzerland&rft.isbn=9783031197710&rft.issn=0302-9743&rft.eissn=1611-3349&rft.spage=1&rft.epage=18&rft_id=info:doi/10.1007%2F978-3-031-19772-7_1
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0302-9743&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0302-9743&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0302-9743&client=summon