Expanding Language-Image Pretrained Models for General Video Recognition

Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data, demonstrating remarkable “zero-shot” generalization ability for various image tasks. However, how to effectively expand such new language-image pretraining methods to v...

Full description

Saved in:
Bibliographic Details
Published inComputer Vision – ECCV 2022 pp. 1 - 18
Main Authors Ni, Bolin, Peng, Houwen, Chen, Minghao, Zhang, Songyang, Meng, Gaofeng, Fu, Jianlong, Xiang, Shiming, Ling, Haibin
Format Book Chapter
LanguageEnglish
Published Cham Springer Nature Switzerland 28.10.2022
SeriesLecture Notes in Computer Science
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data, demonstrating remarkable “zero-shot” generalization ability for various image tasks. However, how to effectively expand such new language-image pretraining methods to video domains is still an open problem. In this work, we present a simple yet effective approach that adapts the pretrained language-image models to video recognition directly, instead of pretraining a new model from scratch. More concretely, to capture the long-range dependencies of frames along the temporal dimension, we propose a cross-frame attention mechanism that explicitly exchanges information across frames. Such module is lightweight and can be plugged into pretrained language-image models seamlessly. Moreover, we propose a video-specific prompting scheme, which leverages video content information for generating discriminative textual prompts. Extensive experiments demonstrate that our approach is effective and can be generalized to different video recognition scenarios. In particular, under fully-supervised settings, our approach achieves a top-1 accuracy of 87.1% on Kinectics-400, while using 12× $$\times $$ fewer FLOPs compared with Swin-L and ViViT-H. In zero-shot experiments, our approach surpasses the current state-of-the-art methods by +7.6% and +14.9% in terms of top-1 accuracy under two popular protocols. In few-shot scenarios, our approach outperforms previous best methods by +32.1% and +23.1% when the labeled data is extremely limited. Code and models are available at here.
AbstractList Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data, demonstrating remarkable “zero-shot” generalization ability for various image tasks. However, how to effectively expand such new language-image pretraining methods to video domains is still an open problem. In this work, we present a simple yet effective approach that adapts the pretrained language-image models to video recognition directly, instead of pretraining a new model from scratch. More concretely, to capture the long-range dependencies of frames along the temporal dimension, we propose a cross-frame attention mechanism that explicitly exchanges information across frames. Such module is lightweight and can be plugged into pretrained language-image models seamlessly. Moreover, we propose a video-specific prompting scheme, which leverages video content information for generating discriminative textual prompts. Extensive experiments demonstrate that our approach is effective and can be generalized to different video recognition scenarios. In particular, under fully-supervised settings, our approach achieves a top-1 accuracy of 87.1% on Kinectics-400, while using 12× $$\times $$ fewer FLOPs compared with Swin-L and ViViT-H. In zero-shot experiments, our approach surpasses the current state-of-the-art methods by +7.6% and +14.9% in terms of top-1 accuracy under two popular protocols. In few-shot scenarios, our approach outperforms previous best methods by +32.1% and +23.1% when the labeled data is extremely limited. Code and models are available at here.
Author Xiang, Shiming
Ling, Haibin
Zhang, Songyang
Peng, Houwen
Fu, Jianlong
Ni, Bolin
Chen, Minghao
Meng, Gaofeng
Author_xml – sequence: 1
  givenname: Bolin
  surname: Ni
  fullname: Ni, Bolin
– sequence: 2
  givenname: Houwen
  surname: Peng
  fullname: Peng, Houwen
  email: houwen.peng@microsoft.com
– sequence: 3
  givenname: Minghao
  surname: Chen
  fullname: Chen, Minghao
– sequence: 4
  givenname: Songyang
  surname: Zhang
  fullname: Zhang, Songyang
– sequence: 5
  givenname: Gaofeng
  surname: Meng
  fullname: Meng, Gaofeng
  email: gfmeng@nlpr.ia.ac.cn
– sequence: 6
  givenname: Jianlong
  surname: Fu
  fullname: Fu, Jianlong
– sequence: 7
  givenname: Shiming
  surname: Xiang
  fullname: Xiang, Shiming
– sequence: 8
  givenname: Haibin
  surname: Ling
  fullname: Ling, Haibin
BookMark eNpVkE1OwzAQhQ0UiVJ6Aja5gGEmTu1kiapSKhWBELC1_DOJAsWu4iJxHM7CyTCFDZsZ6b3R6L3vlI1CDMTYOcIFAqjLRtVccBDIsVGq5ErjAZtmVWRtL6lDNkaJyIWomqN_HsKIjUFAyRtViRM2TekFAEqVbYAxWy0-tib4PnTF2oTu3XTEV295FvcD7QbTB_LFbfS0SUUbh6_PJQUazKZ47j3F4oFc7EK_62M4Y8et2SSa_u0Je7pePM5v-PpuuZpfrXnKeXZcyhmiq8iqppW-btFbI71XYMsZCZSSmrYy-QKMLY23DhwpX7oaa7LOSTFh-Ps3bYccmwZtY3xNGkH_4NK5vBY619d7NjrjEt-hY1vx
ContentType Book Chapter
Copyright The Author(s), under exclusive license to Springer Nature Switzerland AG 2022
Copyright_xml – notice: The Author(s), under exclusive license to Springer Nature Switzerland AG 2022
DOI 10.1007/978-3-031-19772-7_1
DatabaseTitleList
DeliveryMethod fulltext_linktorsrc
Discipline Applied Sciences
Computer Science
EISBN 9783031197727
3031197720
EISSN 1611-3349
Editor Farinella, Giovanni Maria
Avidan, Shai
Cissé, Moustapha
Brostow, Gabriel
Hassner, Tal
Editor_xml – sequence: 1
  givenname: Shai
  surname: Avidan
  fullname: Avidan, Shai
  email: avidan@eng.tau.ac.il
– sequence: 2
  givenname: Gabriel
  orcidid: 0000-0001-8472-3828
  surname: Brostow
  fullname: Brostow, Gabriel
  email: g.brostow@cs.ucl.ac.uk
– sequence: 3
  givenname: Moustapha
  surname: Cissé
  fullname: Cissé, Moustapha
  email: moustaphacisse@google.com
– sequence: 4
  givenname: Giovanni Maria
  orcidid: 0000-0002-6034-0432
  surname: Farinella
  fullname: Farinella, Giovanni Maria
  email: gfarinella@dmi.unict.it
– sequence: 5
  givenname: Tal
  orcidid: 0000-0003-2275-1406
  surname: Hassner
  fullname: Hassner, Tal
  email: talhassner@gmail.com
EndPage 18
GroupedDBID -DT
-~X
29L
2HA
2HV
ACGFS
ADCXD
ALMA_UNASSIGNED_HOLDINGS
EJD
F5P
LAS
LDH
P2P
RSU
~02
ID FETCH-LOGICAL-s197t-66511c4eb79f6d8f1dba6dd70b25e3166e9f4a1c40ab2adbc0ce7d2c818ebcc63
ISBN 9783031197710
3031197712
ISSN 0302-9743
IngestDate Tue Jul 29 20:16:55 EDT 2025
IsPeerReviewed true
IsScholarly true
Language English
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-s197t-66511c4eb79f6d8f1dba6dd70b25e3166e9f4a1c40ab2adbc0ce7d2c818ebcc63
Notes Supplementary InformationThe online version contains supplementary material available at https://doi.org/10.1007/978-3-031-19772-7_1.
B. Ni and M. Chen—Work done during internship at Microsoft Research.
Original Abstract: Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data, demonstrating remarkable “zero-shot” generalization ability for various image tasks. However, how to effectively expand such new language-image pretraining methods to video domains is still an open problem. In this work, we present a simple yet effective approach that adapts the pretrained language-image models to video recognition directly, instead of pretraining a new model from scratch. More concretely, to capture the long-range dependencies of frames along the temporal dimension, we propose a cross-frame attention mechanism that explicitly exchanges information across frames. Such module is lightweight and can be plugged into pretrained language-image models seamlessly. Moreover, we propose a video-specific prompting scheme, which leverages video content information for generating discriminative textual prompts. Extensive experiments demonstrate that our approach is effective and can be generalized to different video recognition scenarios. In particular, under fully-supervised settings, our approach achieves a top-1 accuracy of 87.1% on Kinectics-400, while using 12×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document} fewer FLOPs compared with Swin-L and ViViT-H. In zero-shot experiments, our approach surpasses the current state-of-the-art methods by +7.6% and +14.9% in terms of top-1 accuracy under two popular protocols. In few-shot scenarios, our approach outperforms previous best methods by +32.1% and +23.1% when the labeled data is extremely limited. Code and models are available at here.
PageCount 18
ParticipantIDs springer_books_10_1007_978_3_031_19772_7_1
PublicationCentury 2000
PublicationDate 20221028
PublicationDateYYYYMMDD 2022-10-28
PublicationDate_xml – month: 10
  year: 2022
  text: 20221028
  day: 28
PublicationDecade 2020
PublicationPlace Cham
PublicationPlace_xml – name: Cham
PublicationSeriesTitle Lecture Notes in Computer Science
PublicationSeriesTitleAlternate Lect.Notes Computer
PublicationSubtitle 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV
PublicationTitle Computer Vision – ECCV 2022
PublicationYear 2022
Publisher Springer Nature Switzerland
Publisher_xml – name: Springer Nature Switzerland
RelatedPersons Hartmanis, Juris
Gao, Wen
Steffen, Bernhard
Bertino, Elisa
Goos, Gerhard
Yung, Moti
RelatedPersons_xml – sequence: 1
  givenname: Gerhard
  surname: Goos
  fullname: Goos, Gerhard
– sequence: 2
  givenname: Juris
  surname: Hartmanis
  fullname: Hartmanis, Juris
– sequence: 3
  givenname: Elisa
  surname: Bertino
  fullname: Bertino, Elisa
– sequence: 4
  givenname: Wen
  surname: Gao
  fullname: Gao, Wen
– sequence: 5
  givenname: Bernhard
  orcidid: 0000-0001-9619-1558
  surname: Steffen
  fullname: Steffen, Bernhard
– sequence: 6
  givenname: Moti
  orcidid: 0000-0003-0848-0873
  surname: Yung
  fullname: Yung, Moti
SSID ssj0002731100
ssj0002792
Score 2.5353372
Snippet Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data, demonstrating remarkable...
SourceID springer
SourceType Publisher
StartPage 1
SubjectTerms Contrastive language-image pretraining
Video recognition
Title Expanding Language-Image Pretrained Models for General Video Recognition
URI http://link.springer.com/10.1007/978-3-031-19772-7_1
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3JbtRAEG0NwwVxYBe7-kAuWEZeu-0Dh2ANmkSTuZCMcrN6JUgwljIeReQX-Am-hS-jevEyGS7hYlkty27Xs6urqutVIfSOpBkRMtMhZyUFByXOQ0aoCDNBuSwKVvDcsJFPlmR-lh2f5-eTya9R1tK25R_E9T95Jf-DKowBroYlewtk-5vCAJwDvnAEhOF4w_jdDbO6ugK-H0OwsvTwoEtbSINZVa2CJEr6xOGl3bL_ZPrzDKrQ_eTzZns1sMEqz9U4gQXtgjX7UeVm_fUn84sd67P4PDVm4WOf4dEPmwl0qWwDCjBpTce177byw0GVHBxGvtg1TF2qxpiuLonJfyJGdmrzceG3N5ZNa7PGgv6NvUIaRyzA2QVV7xng44ilScc2d_ly9a29dtTmHecWFlezx0l92qsneYECBxfI6UTldDYxlRhTV_nU6-F4tKA7_b63VIyzQ-BJoXkUOBs1uNJ3aJFP0d3D2fFi1QfswM4z5fX6Zd5UXnRbVG5GhjjUzThxpZ2GN-jrXbmSxjeeuLcLb42b04foviG8YMNEAeE-QhO1foweePcEe1lvYKiTfzf2BB314ONd8PEAPnbgYwD_z28PPLbA4xHwT9HZ59lpNQ99c45wA_NuQ0LAVBeZ4rTURBY6lpwRKWnEk1ylMSGq1BmDKyLGEya5iISiMhFgICouBEmfoem6WavnCGtaFrnp4xDrNJOxLjUTMJASRkvJYv0Cve_EU5vfbVN3tbZBlnVagyxrK8saZPnyNhe_QveGD_Q1mraXW_UGjMyWv_Xw_wXcnXK4
linkProvider Library Specific Holdings
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=bookitem&rft.title=Computer+Vision+%E2%80%93+ECCV+2022&rft.au=Ni%2C+Bolin&rft.au=Peng%2C+Houwen&rft.au=Chen%2C+Minghao&rft.au=Zhang%2C+Songyang&rft.atitle=Expanding+Language-Image+Pretrained+Models+for%C2%A0General+Video+Recognition&rft.series=Lecture+Notes+in+Computer+Science&rft.date=2022-10-28&rft.pub=Springer+Nature+Switzerland&rft.isbn=9783031197710&rft.issn=0302-9743&rft.eissn=1611-3349&rft.spage=1&rft.epage=18&rft_id=info:doi/10.1007%2F978-3-031-19772-7_1
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0302-9743&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0302-9743&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0302-9743&client=summon