Cross-Modal Contrastive Learning Network for Few-Shot Action Recognition

Few-shot action recognition aims to recognize new unseen categories with only a few labeled samples of each class. However, it still suffers from the limitation of inadequate data, which easily leads to the overfitting and low-generalization problems. Therefore, we propose a cross-modal contrastive...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on image processing Vol. 33; pp. 1257 - 1271
Main Authors Wang, Xiao, Yan, Yan, Hu, Hai-Miao, Li, Bo, Wang, Hanzi
Format Journal Article
LanguageEnglish
Published United States IEEE 2024
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Few-shot action recognition aims to recognize new unseen categories with only a few labeled samples of each class. However, it still suffers from the limitation of inadequate data, which easily leads to the overfitting and low-generalization problems. Therefore, we propose a cross-modal contrastive learning network (CCLN), consisting of an adversarial branch and a contrastive branch, to perform effective few-shot action recognition. In the adversarial branch, we elaborately design a prototypical generative adversarial network (PGAN) to obtain synthesized samples for increasing training samples, which can mitigate the data scarcity problem and thereby alleviate the overfitting problem. When the training samples are limited, the obtained visual features are usually suboptimal for video understanding as they lack discriminative information. To address this issue, in the contrastive branch, we propose a cross-modal contrastive learning module (CCLM) to obtain discriminative feature representations of samples with the help of semantic information, which can enable the network to enhance the feature learning ability at the class-level. Moreover, since videos contain crucial sequences and ordering information, thus we introduce a spatial-temporal enhancement module (SEM) to model the spatial context within video frames and the temporal context across video frames. The experimental results show that the proposed CCLN outperforms the state-of-the-art few-shot action recognition methods on four challenging benchmarks, including Kinetics, UCF101, HMDB51 and SSv2.
AbstractList Few-shot action recognition aims to recognize new unseen categories with only a few labeled samples of each class. However, it still suffers from the limitation of inadequate data, which easily leads to the overfitting and low-generalization problems. Therefore, we propose a cross-modal contrastive learning network (CCLN), consisting of an adversarial branch and a contrastive branch, to perform effective few-shot action recognition. In the adversarial branch, we elaborately design a prototypical generative adversarial network (PGAN) to obtain synthesized samples for increasing training samples, which can mitigate the data scarcity problem and thereby alleviate the overfitting problem. When the training samples are limited, the obtained visual features are usually suboptimal for video understanding as they lack discriminative information. To address this issue, in the contrastive branch, we propose a cross-modal contrastive learning module (CCLM) to obtain discriminative feature representations of samples with the help of semantic information, which can enable the network to enhance the feature learning ability at the class-level. Moreover, since videos contain crucial sequences and ordering information, thus we introduce a spatial-temporal enhancement module (SEM) to model the spatial context within video frames and the temporal context across video frames. The experimental results show that the proposed CCLN outperforms the state-of-the-art few-shot action recognition methods on four challenging benchmarks, including Kinetics, UCF101, HMDB51 and SSv2.
Author Li, Bo
Yan, Yan
Wang, Hanzi
Wang, Xiao
Hu, Hai-Miao
Author_xml – sequence: 1
  givenname: Xiao
  surname: Wang
  fullname: Wang, Xiao
  email: xiaowang@stu.xmu.edu.cn
  organization: Fujian Key Laboratory of Sensing and Computing for Smart City, School of Informatics, Xiamen University, Xiamen, China
– sequence: 2
  givenname: Yan
  orcidid: 0000-0002-3674-7160
  surname: Yan
  fullname: Yan, Yan
  email: yanyan@xmu.edu.cn
  organization: Fujian Key Laboratory of Sensing and Computing for Smart City, School of Informatics, Xiamen University, Xiamen, China
– sequence: 3
  givenname: Hai-Miao
  orcidid: 0000-0001-6811-9209
  surname: Hu
  fullname: Hu, Hai-Miao
  email: frank0139@163.com
  organization: School of Computer Science and Engineering, Beihang University, Beijing, China
– sequence: 4
  givenname: Bo
  orcidid: 0000-0001-5980-4861
  surname: Li
  fullname: Li, Bo
  email: boli@buaa.edu.cn
  organization: School of Computer Science and Engineering, Beihang University, Beijing, China
– sequence: 5
  givenname: Hanzi
  orcidid: 0000-0002-6913-9786
  surname: Wang
  fullname: Wang, Hanzi
  email: hanzi.wang@xmu.edu.cn
  organization: Fujian Key Laboratory of Sensing and Computing for Smart City, School of Informatics, Xiamen University, Xiamen, China
BackLink https://www.ncbi.nlm.nih.gov/pubmed/38252570$$D View this record in MEDLINE/PubMed
BookMark eNpdkEtLAzEUhYMovvcuRAbcuJl685okSym-oD7wsR7S9I6OtokmU8V_b4ZWEVf3LL5zuHxbZNUHj4TsURhQCub44fJ2wICJAedSUBArZJMaQUsAwVZzBqlKRYXZIFspvQBQIWm1Tja4ZpJJBZvkYhhDSuVVmNhpMQy-izZ17QcWI7TRt_6puMbuM8TXogmxOMPP8v45dMWJ69rgizt04cm3fd4ha42dJtxd3m3yeHb6MLwoRzfnl8OTUem40F1pgKNTlvEKwLhKamFQ6UoYpmwjx2PjtNWmmVTUaNATKZqGMjV2OTjGreLb5Gix-xbD-xxTV8_a5HA6tR7DPNXM0LxXSejRw3_oS5hHn7_LFKtYpTiYTMGCcr2JiE39FtuZjV81hbq3XGfLdW-5XlrOlYPl8Hw8w8lv4UdrBvYXQIuIf_YEpVoC_wahUIBm
CODEN IIPRE4
CitedBy_id crossref_primary_10_1109_OJCS_2024_3406645
ContentType Journal Article
Copyright Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024
Copyright_xml – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024
DBID 97E
RIA
RIE
NPM
AAYXX
CITATION
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
7X8
DOI 10.1109/TIP.2024.3354104
DatabaseName IEEE All-Society Periodicals Package (ASPP) 2005-present
IEEE All-Society Periodicals Package (ASPP) Online
IEEE Electronic Library (IEL)
PubMed
CrossRef
Computer and Information Systems Abstracts
Electronics & Communications Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
MEDLINE - Academic
DatabaseTitle PubMed
CrossRef
Technology Research Database
Computer and Information Systems Abstracts – Academic
Electronics & Communications Abstracts
ProQuest Computer Science Collection
Computer and Information Systems Abstracts
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts Professional
MEDLINE - Academic
DatabaseTitleList PubMed

Technology Research Database
MEDLINE - Academic
Database_xml – sequence: 1
  dbid: NPM
  name: PubMed
  url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
– sequence: 2
  dbid: RIE
  name: IEEE Xplore
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Applied Sciences
Engineering
EISSN 1941-0042
EndPage 1271
ExternalDocumentID 10_1109_TIP_2024_3354104
38252570
10411850
Genre orig-research
Journal Article
GrantInformation_xml – fundername: National Natural Science Foundation of China
  grantid: U21A20514; 62122011; 62372388; 62071404
  funderid: 10.13039/501100001809
– fundername: National Key Research and Development Program of China
  grantid: 2022ZD0160402
  funderid: 10.13039/501100012166
– fundername: Fuxiaquan National Independent Innovation Demonstration Zone Collaborative Innovation Platform Project
  grantid: 3502ZCQXT2022008
GroupedDBID ---
-~X
.DC
0R~
29I
4.4
53G
5GY
5VS
6IK
97E
AAJGR
AASAJ
AAYOK
ABFSI
ABQJQ
ABVLG
ACGFO
ACGFS
ACIWK
AENEX
AETIX
AI.
AIBXA
AKJIK
ALLEH
ALMA_UNASSIGNED_HOLDINGS
ASUFR
ATWAV
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CS3
DU5
E.L
EBS
EJD
F5P
HZ~
H~9
ICLAB
IFIPE
IFJZH
IPLJI
JAVBF
LAI
M43
MS~
O9-
OCL
P2P
RIA
RIE
RIG
RNS
TAE
TN5
VH1
XFK
NPM
AAYXX
CITATION
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
7X8
ID FETCH-LOGICAL-c348t-903ec7a236009c65849e7864927af5bb9c8a89fd619808d54ff127bc54fc23a73
IEDL.DBID RIE
ISSN 1057-7149
IngestDate Sat Aug 17 00:24:03 EDT 2024
Fri Sep 13 08:35:57 EDT 2024
Fri Aug 23 03:05:38 EDT 2024
Sat Sep 28 08:08:55 EDT 2024
Wed Jun 26 19:27:46 EDT 2024
IsPeerReviewed true
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c348t-903ec7a236009c65849e7864927af5bb9c8a89fd619808d54ff127bc54fc23a73
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ORCID 0000-0002-3674-7160
0000-0001-5980-4861
0000-0002-6913-9786
0000-0001-6811-9209
PMID 38252570
PQID 2926267309
PQPubID 85429
PageCount 15
ParticipantIDs proquest_miscellaneous_2917866507
proquest_journals_2926267309
crossref_primary_10_1109_TIP_2024_3354104
ieee_primary_10411850
pubmed_primary_38252570
PublicationCentury 2000
PublicationDate 20240000
2024-00-00
20240101
PublicationDateYYYYMMDD 2024-01-01
PublicationDate_xml – year: 2024
  text: 20240000
PublicationDecade 2020
PublicationPlace United States
PublicationPlace_xml – name: United States
– name: New York
PublicationTitle IEEE transactions on image processing
PublicationTitleAbbrev TIP
PublicationTitleAlternate IEEE Trans Image Process
PublicationYear 2024
Publisher IEEE
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher_xml – name: IEEE
– name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
SSID ssj0014516
Score 2.4839685
Snippet Few-shot action recognition aims to recognize new unseen categories with only a few labeled samples of each class. However, it still suffers from the...
SourceID proquest
crossref
pubmed
ieee
SourceType Aggregation Database
Index Database
Publisher
StartPage 1257
SubjectTerms action recognition
Activity recognition
Context
contrastive learning
Feature extraction
Few-shot learning
Frames (data processing)
Generative adversarial networks
Image recognition
Machine learning
meta-learning
Modules
Self-supervised learning
Semantics
Task analysis
Three-dimensional displays
video understanding
Visual discrimination
Visualization
Title Cross-Modal Contrastive Learning Network for Few-Shot Action Recognition
URI https://ieeexplore.ieee.org/document/10411850
https://www.ncbi.nlm.nih.gov/pubmed/38252570
https://www.proquest.com/docview/2926267309/abstract/
https://search.proquest.com/docview/2917866507
Volume 33
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1La9wwEB7aHEJ6yKtp6zYJKuSSg7ZeS7akYwhZNoEspUkgNyNLcgMFu2S9BPrrOyPbS1II9GAQWJZlzYz1jeYFcGKraeEl-fl5kXOpC88N6hlcOe897qBGaoodvl4U8zt5dZ_fD8HqMRYmhBCdz8KEmtGW71u3oqMylHCJeJg09Lc6zfpgrbXJgCrORtNmrrhC3D_aJFPz7fbyO2qCmZwIkUscZAs2BWpGVMDtxXYU66u8DjXjljPbgcU42d7T5Ndk1VUT9-efPI7__TW7sD2AT3bWc8sevAnNPuwMQJQNYr7ch3fPshS-h_k5TZ1ftx6fpWxWj3ZJP0k25Gb9yRa9LzlDAMxm4YnfPLQdO4sRE-zH6KHUNgdwN7u4PZ_zoQADd0LqjptUBKdsJhAVGUdYxQSlC2kyZeu8qozTVpsaaWp0qn0u63qaqcphw2XCKvEBNpq2CZ-AmamotLXS1dZJ5AsTah3ySmU1Yka8Ejgd6VD-7vNslFE_SU2J5CuJfOVAvgQOaDWf9esXMoHDkXLlIInLMqOEiAX-x0wCX9e3UYbIMGKb0K6oz1RR3r9UJfCxp_h68JFRPr_y0i-wRXPrT2UOYaN7XIUjxClddRz58y-EVd-r
link.rule.ids 315,786,790,802,4043,27956,27957,27958,55109
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3_a9QwFH_IBJ0_bDqn1k2N4C_-kFuvSZvkxzE8bro7RG-w30qapApCK7segn_93kvbYw4G_lAINE3TvPeaz8v7BvDBVtPCS_Lz8yLnUheeG9QzuHLee9xBjdQUO7xYFvNL-fkqvxqC1WMsTAghOp-FCTWjLd-3bkNHZSjhEvEwaegPcaNPTR-utTUaUM3ZaNzMFVeI_EerZGpOVudfURfM5ESIXOIwu_BIoG5EJdz-2ZBihZX7wWbcdGb7sByn2_ua_Jpsumri_t7J5Pjf3_MU9gb4yU57fnkGD0JzAPsDFGWDoK8P4MmtPIXPYX5GU-eL1uOzlM_q2q7pN8mG7Kw_2LL3JmcIgdks_OHff7YdO40xE-zb6KPUNodwOfu0OpvzoQQDd0LqjptUBKdsJhAXGUdoxQSlC2kyZeu8qozTVpsaqWp0qn0u63qaqcphw2XCKvECdpq2Ca-AmamotLXS1dZJ5AwTah3ySmU1oka8Evg40qH83WfaKKOGkpoSyVcS-cqBfAkc0mre6tcvZALHI-XKQRbXZUYpEQv8k5kE3m9voxSRacQ2od1Qn6mizH-pSuBlT_Ht4COjvL7npe_g8Xy1uCgvzpdfjmCX5tmf0RzDTne9CW8QtXTV28irN36o4wE
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Cross-modal+Contrastive+Learning+Network+for+Few-Shot+Action+Recognition&rft.jtitle=IEEE+transactions+on+image+processing&rft.au=Wang%2C+Xiao&rft.au=Yan%2C+Yan&rft.au=Hu%2C+Hai-Miao&rft.au=Li%2C+Bo&rft.date=2024&rft.issn=1057-7149&rft.eissn=1941-0042&rft.spage=1&rft.epage=1&rft_id=info:doi/10.1109%2FTIP.2024.3354104&rft.externalDBID=n%2Fa&rft.externalDocID=10_1109_TIP_2024_3354104
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1057-7149&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1057-7149&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1057-7149&client=summon