Cross-Modal Contrastive Learning Network for Few-Shot Action Recognition

Few-shot action recognition aims to recognize new unseen categories with only a few labeled samples of each class. However, it still suffers from the limitation of inadequate data, which easily leads to the overfitting and low-generalization problems. Therefore, we propose a cross-modal contrastive...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on image processing Vol. 33; pp. 1257 - 1271
Main Authors	Wang, Xiao, Yan, Yan, Hu, Hai-Miao, Li, Bo, Wang, Hanzi
Format	Journal Article
Language	English
Published	United States IEEE 2024 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	action recognition Activity recognition Context contrastive learning Feature extraction Few-shot learning Frames (data processing) Generative adversarial networks Image recognition Machine learning meta-learning Modules Self-supervised learning Semantics Task analysis Three-dimensional displays video understanding Visual discrimination Visualization
Online Access	Get full text

Cover

Loading…

Abstract	Few-shot action recognition aims to recognize new unseen categories with only a few labeled samples of each class. However, it still suffers from the limitation of inadequate data, which easily leads to the overfitting and low-generalization problems. Therefore, we propose a cross-modal contrastive learning network (CCLN), consisting of an adversarial branch and a contrastive branch, to perform effective few-shot action recognition. In the adversarial branch, we elaborately design a prototypical generative adversarial network (PGAN) to obtain synthesized samples for increasing training samples, which can mitigate the data scarcity problem and thereby alleviate the overfitting problem. When the training samples are limited, the obtained visual features are usually suboptimal for video understanding as they lack discriminative information. To address this issue, in the contrastive branch, we propose a cross-modal contrastive learning module (CCLM) to obtain discriminative feature representations of samples with the help of semantic information, which can enable the network to enhance the feature learning ability at the class-level. Moreover, since videos contain crucial sequences and ordering information, thus we introduce a spatial-temporal enhancement module (SEM) to model the spatial context within video frames and the temporal context across video frames. The experimental results show that the proposed CCLN outperforms the state-of-the-art few-shot action recognition methods on four challenging benchmarks, including Kinetics, UCF101, HMDB51 and SSv2.
AbstractList	Few-shot action recognition aims to recognize new unseen categories with only a few labeled samples of each class. However, it still suffers from the limitation of inadequate data, which easily leads to the overfitting and low-generalization problems. Therefore, we propose a cross-modal contrastive learning network (CCLN), consisting of an adversarial branch and a contrastive branch, to perform effective few-shot action recognition. In the adversarial branch, we elaborately design a prototypical generative adversarial network (PGAN) to obtain synthesized samples for increasing training samples, which can mitigate the data scarcity problem and thereby alleviate the overfitting problem. When the training samples are limited, the obtained visual features are usually suboptimal for video understanding as they lack discriminative information. To address this issue, in the contrastive branch, we propose a cross-modal contrastive learning module (CCLM) to obtain discriminative feature representations of samples with the help of semantic information, which can enable the network to enhance the feature learning ability at the class-level. Moreover, since videos contain crucial sequences and ordering information, thus we introduce a spatial-temporal enhancement module (SEM) to model the spatial context within video frames and the temporal context across video frames. The experimental results show that the proposed CCLN outperforms the state-of-the-art few-shot action recognition methods on four challenging benchmarks, including Kinetics, UCF101, HMDB51 and SSv2.
Author	Li, Bo Yan, Yan Wang, Hanzi Wang, Xiao Hu, Hai-Miao
Author_xml	– sequence: 1 givenname: Xiao surname: Wang fullname: Wang, Xiao email: xiaowang@stu.xmu.edu.cn organization: Fujian Key Laboratory of Sensing and Computing for Smart City, School of Informatics, Xiamen University, Xiamen, China – sequence: 2 givenname: Yan orcidid: 0000-0002-3674-7160 surname: Yan fullname: Yan, Yan email: yanyan@xmu.edu.cn organization: Fujian Key Laboratory of Sensing and Computing for Smart City, School of Informatics, Xiamen University, Xiamen, China – sequence: 3 givenname: Hai-Miao orcidid: 0000-0001-6811-9209 surname: Hu fullname: Hu, Hai-Miao email: frank0139@163.com organization: School of Computer Science and Engineering, Beihang University, Beijing, China – sequence: 4 givenname: Bo orcidid: 0000-0001-5980-4861 surname: Li fullname: Li, Bo email: boli@buaa.edu.cn organization: School of Computer Science and Engineering, Beihang University, Beijing, China – sequence: 5 givenname: Hanzi orcidid: 0000-0002-6913-9786 surname: Wang fullname: Wang, Hanzi email: hanzi.wang@xmu.edu.cn organization: Fujian Key Laboratory of Sensing and Computing for Smart City, School of Informatics, Xiamen University, Xiamen, China
BackLink	https://www.ncbi.nlm.nih.gov/pubmed/38252570$$D View this record in MEDLINE/PubMed
BookMark	eNpdkEtLAzEUhYMovvcuRAbcuJl685okSym-oD7wsR7S9I6OtokmU8V_b4ZWEVf3LL5zuHxbZNUHj4TsURhQCub44fJ2wICJAedSUBArZJMaQUsAwVZzBqlKRYXZIFspvQBQIWm1Tja4ZpJJBZvkYhhDSuVVmNhpMQy-izZ17QcWI7TRt_6puMbuM8TXogmxOMPP8v45dMWJ69rgizt04cm3fd4ha42dJtxd3m3yeHb6MLwoRzfnl8OTUem40F1pgKNTlvEKwLhKamFQ6UoYpmwjx2PjtNWmmVTUaNATKZqGMjV2OTjGreLb5Gix-xbD-xxTV8_a5HA6tR7DPNXM0LxXSejRw3_oS5hHn7_LFKtYpTiYTMGCcr2JiE39FtuZjV81hbq3XGfLdW-5XlrOlYPl8Hw8w8lv4UdrBvYXQIuIf_YEpVoC_wahUIBm
CODEN	IIPRE4
CitedBy_id	crossref_primary_10_1109_OJCS_2024_3406645
ContentType	Journal Article
Copyright	Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024
Copyright_xml	– notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024
DBID	97E RIA RIE NPM AAYXX CITATION 7SC 7SP 8FD JQ2 L7M L~C L~D 7X8
DOI	10.1109/TIP.2024.3354104
DatabaseName	IEEE All-Society Periodicals Package (ASPP) 2005-present IEEE All-Society Periodicals Package (ASPP) Online IEEE Electronic Library (IEL) PubMed CrossRef Computer and Information Systems Abstracts Electronics & Communications Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional MEDLINE - Academic
DatabaseTitle	PubMed CrossRef Technology Research Database Computer and Information Systems Abstracts – Academic Electronics & Communications Abstracts ProQuest Computer Science Collection Computer and Information Systems Abstracts Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Professional MEDLINE - Academic
DatabaseTitleList	PubMed Technology Research Database MEDLINE - Academic
Database_xml	– sequence: 1 dbid: NPM name: PubMed url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 2 dbid: RIE name: IEEE Xplore url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Applied Sciences Engineering
EISSN	1941-0042
EndPage	1271
ExternalDocumentID	10_1109_TIP_2024_3354104 38252570 10411850
Genre	orig-research Journal Article
GrantInformation_xml	– fundername: National Natural Science Foundation of China grantid: U21A20514; 62122011; 62372388; 62071404 funderid: 10.13039/501100001809 – fundername: National Key Research and Development Program of China grantid: 2022ZD0160402 funderid: 10.13039/501100012166 – fundername: Fuxiaquan National Independent Innovation Demonstration Zone Collaborative Innovation Platform Project grantid: 3502ZCQXT2022008
GroupedDBID	--- -~X .DC 0R~ 29I 4.4 53G 5GY 5VS 6IK 97E AAJGR AASAJ AAYOK ABFSI ABQJQ ABVLG ACGFO ACGFS ACIWK AENEX AETIX AI. AIBXA AKJIK ALLEH ALMA_UNASSIGNED_HOLDINGS ASUFR ATWAV BEFXN BFFAM BGNUA BKEBE BPEOZ CS3 DU5 E.L EBS EJD F5P HZ~ H~9 ICLAB IFIPE IFJZH IPLJI JAVBF LAI M43 MS~ O9- OCL P2P RIA RIE RIG RNS TAE TN5 VH1 XFK NPM AAYXX CITATION 7SC 7SP 8FD JQ2 L7M L~C L~D 7X8
ID	FETCH-LOGICAL-c348t-903ec7a236009c65849e7864927af5bb9c8a89fd619808d54ff127bc54fc23a73
IEDL.DBID	RIE
ISSN	1057-7149
IngestDate	Sat Aug 17 00:24:03 EDT 2024 Fri Sep 13 08:35:57 EDT 2024 Fri Aug 23 03:05:38 EDT 2024 Sat Sep 28 08:08:55 EDT 2024 Wed Jun 26 19:27:46 EDT 2024
IsPeerReviewed	true
IsScholarly	true
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c348t-903ec7a236009c65849e7864927af5bb9c8a89fd619808d54ff127bc54fc23a73
Notes	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ORCID	0000-0002-3674-7160 0000-0001-5980-4861 0000-0002-6913-9786 0000-0001-6811-9209
PMID	38252570
PQID	2926267309
PQPubID	85429
PageCount	15
ParticipantIDs	proquest_miscellaneous_2917866507 proquest_journals_2926267309 crossref_primary_10_1109_TIP_2024_3354104 ieee_primary_10411850 pubmed_primary_38252570
PublicationCentury	2000
PublicationDate	20240000 2024-00-00 20240101
PublicationDateYYYYMMDD	2024-01-01
PublicationDate_xml	– year: 2024 text: 20240000
PublicationDecade	2020
PublicationPlace	United States
PublicationPlace_xml	– name: United States – name: New York
PublicationTitle	IEEE transactions on image processing
PublicationTitleAbbrev	TIP
PublicationTitleAlternate	IEEE Trans Image Process
PublicationYear	2024
Publisher	IEEE The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher_xml	– name: IEEE – name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
SSID	ssj0014516
Score	2.4839685
Snippet	Few-shot action recognition aims to recognize new unseen categories with only a few labeled samples of each class. However, it still suffers from the...
SourceID	proquest crossref pubmed ieee
SourceType	Aggregation Database Index Database Publisher
StartPage	1257
SubjectTerms	action recognition Activity recognition Context contrastive learning Feature extraction Few-shot learning Frames (data processing) Generative adversarial networks Image recognition Machine learning meta-learning Modules Self-supervised learning Semantics Task analysis Three-dimensional displays video understanding Visual discrimination Visualization
Title	Cross-Modal Contrastive Learning Network for Few-Shot Action Recognition
URI	https://ieeexplore.ieee.org/document/10411850 https://www.ncbi.nlm.nih.gov/pubmed/38252570 https://www.proquest.com/docview/2926267309/abstract/ https://search.proquest.com/docview/2917866507
Volume	33
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1La9wwEB7aHEJ6yKtp6zYJKuSSg7ZeS7akYwhZNoEspUkgNyNLcgMFu2S9BPrrOyPbS1II9GAQWJZlzYz1jeYFcGKraeEl-fl5kXOpC88N6hlcOe897qBGaoodvl4U8zt5dZ_fD8HqMRYmhBCdz8KEmtGW71u3oqMylHCJeJg09Lc6zfpgrbXJgCrORtNmrrhC3D_aJFPz7fbyO2qCmZwIkUscZAs2BWpGVMDtxXYU66u8DjXjljPbgcU42d7T5Ndk1VUT9-efPI7__TW7sD2AT3bWc8sevAnNPuwMQJQNYr7ch3fPshS-h_k5TZ1ftx6fpWxWj3ZJP0k25Gb9yRa9LzlDAMxm4YnfPLQdO4sRE-zH6KHUNgdwN7u4PZ_zoQADd0LqjptUBKdsJhAVGUdYxQSlC2kyZeu8qozTVpsaaWp0qn0u63qaqcphw2XCKvEBNpq2CZ-AmamotLXS1dZJ5AsTah3ySmU1Yka8Ejgd6VD-7vNslFE_SU2J5CuJfOVAvgQOaDWf9esXMoHDkXLlIInLMqOEiAX-x0wCX9e3UYbIMGKb0K6oz1RR3r9UJfCxp_h68JFRPr_y0i-wRXPrT2UOYaN7XIUjxClddRz58y-EVd-r
link.rule.ids	315,786,790,802,4043,27956,27957,27958,55109
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3_a9QwFH_IBJ0_bDqn1k2N4C_-kFuvSZvkxzE8bro7RG-w30qapApCK7segn_93kvbYw4G_lAINE3TvPeaz8v7BvDBVtPCS_Lz8yLnUheeG9QzuHLee9xBjdQUO7xYFvNL-fkqvxqC1WMsTAghOp-FCTWjLd-3bkNHZSjhEvEwaegPcaNPTR-utTUaUM3ZaNzMFVeI_EerZGpOVudfURfM5ESIXOIwu_BIoG5EJdz-2ZBihZX7wWbcdGb7sByn2_ua_Jpsumri_t7J5Pjf3_MU9gb4yU57fnkGD0JzAPsDFGWDoK8P4MmtPIXPYX5GU-eL1uOzlM_q2q7pN8mG7Kw_2LL3JmcIgdks_OHff7YdO40xE-zb6KPUNodwOfu0OpvzoQQDd0LqjptUBKdsJhAXGUdoxQSlC2kyZeu8qozTVpsaqWp0qn0u63qaqcphw2XCKvECdpq2Ca-AmamotLXS1dZJ5AwTah3ySmU1oka8Evg40qH83WfaKKOGkpoSyVcS-cqBfAkc0mre6tcvZALHI-XKQRbXZUYpEQv8k5kE3m9voxSRacQ2od1Qn6mizH-pSuBlT_Ht4COjvL7npe_g8Xy1uCgvzpdfjmCX5tmf0RzDTne9CW8QtXTV28irN36o4wE
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Cross-modal+Contrastive+Learning+Network+for+Few-Shot+Action+Recognition&rft.jtitle=IEEE+transactions+on+image+processing&rft.au=Wang%2C+Xiao&rft.au=Yan%2C+Yan&rft.au=Hu%2C+Hai-Miao&rft.au=Li%2C+Bo&rft.date=2024&rft.issn=1057-7149&rft.eissn=1941-0042&rft.spage=1&rft.epage=1&rft_id=info:doi/10.1109%2FTIP.2024.3354104&rft.externalDBID=n%2Fa&rft.externalDocID=10_1109_TIP_2024_3354104
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1057-7149&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1057-7149&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1057-7149&client=summon