EAN: Event Adaptive Network for Enhanced Action Recognition

Efficiently modeling spatial–temporal information in videos is crucial for action recognition. To achieve this goal, state-of-the-art methods typically employ the convolution operator and the dense interaction modules such as non-local blocks. However, these methods cannot accurately fit the diverse...

Full description

Saved in:

Bibliographic Details
Published in	International journal of computer vision Vol. 130; no. 10; pp. 2453 - 2471
Main Authors	Tian, Yuan, Yan, Yichao, Zhai, Guangtao, Guo, Guodong, Gao, Zhiyong
Format	Journal Article
Language	English
Published	New York Springer US 01.10.2022 Springer Springer Nature B.V
Subjects	Activity recognition Artificial Intelligence Computer Imaging Computer Science Design Image Processing and Computer Vision Interaction models Modelling Modules Pattern Recognition Pattern Recognition and Graphics Sensors Video Vision Action recognition Dynamic neural networks Vision transformers Motion representation
Online Access	Get full text

Cover

Loading…

Abstract	Efficiently modeling spatial–temporal information in videos is crucial for action recognition. To achieve this goal, state-of-the-art methods typically employ the convolution operator and the dense interaction modules such as non-local blocks. However, these methods cannot accurately fit the diverse events in videos. On the one hand, the adopted convolutions are with fixed scales, thus struggling with events of various scales. On the other hand, the dense interaction modeling paradigm only achieves sub-optimal performance as action-irrelevant parts bring additional noises for the final prediction. In this paper, we propose a unified action recognition framework to investigate the dynamic nature of video content by introducing the following designs. First, when extracting local cues, we generate the spatial–temporal kernels of dynamic-scale to adaptively fit the diverse events. Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer, which yields a sparse paradigm. We call the proposed framework as Event Adaptive Network because both key designs are adaptive to the input video content. To exploit the short-term motions within local segments, we propose a novel and efficient Latent Motion Code module, further improving the performance of the framework. Extensive experiments on several large-scale video datasets, e.g., Something-to-Something V1 &V2, Kinetics, and Diving48, verify that our models achieve state-of-the-art or competitive performances at low FLOPs. Codes are available at: https://github.com/tianyuan168326/EAN-Pytorch .
AbstractList	Efficiently modeling spatial–temporal information in videos is crucial for action recognition. To achieve this goal, state-of-the-art methods typically employ the convolution operator and the dense interaction modules such as non-local blocks. However, these methods cannot accurately fit the diverse events in videos. On the one hand, the adopted convolutions are with fixed scales, thus struggling with events of various scales. On the other hand, the dense interaction modeling paradigm only achieves sub-optimal performance as action-irrelevant parts bring additional noises for the final prediction. In this paper, we propose a unified action recognition framework to investigate the dynamic nature of video content by introducing the following designs. First, when extracting local cues, we generate the spatial–temporal kernels of dynamic-scale to adaptively fit the diverse events. Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer, which yields a sparse paradigm. We call the proposed framework as Event Adaptive Network because both key designs are adaptive to the input video content. To exploit the short-term motions within local segments, we propose a novel and efficient Latent Motion Code module, further improving the performance of the framework. Extensive experiments on several large-scale video datasets, e.g., Something-to-Something V1 &V2, Kinetics, and Diving48, verify that our models achieve state-of-the-art or competitive performances at low FLOPs. Codes are available at: https://github.com/tianyuan168326/EAN-Pytorch . Efficiently modeling spatial-temporal information in videos is crucial for action recognition. To achieve this goal, state-of-the-art methods typically employ the convolution operator and the dense interaction modules such as non-local blocks. However, these methods cannot accurately fit the diverse events in videos. On the one hand, the adopted convolutions are with fixed scales, thus struggling with events of various scales. On the other hand, the dense interaction modeling paradigm only achieves sub-optimal performance as action-irrelevant parts bring additional noises for the final prediction. In this paper, we propose a unified action recognition framework to investigate the dynamic nature of video content by introducing the following designs. First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events. Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer, which yields a sparse paradigm. We call the proposed framework as Event Adaptive Network because both key designs are adaptive to the input video content. To exploit the short-term motions within local segments, we propose a novel and efficient Latent Motion Code module, further improving the performance of the framework. Extensive experiments on several large-scale video datasets, e.g., Something-to-Something V1 &V2, Kinetics, and Diving48, verify that our models achieve state-of-the-art or competitive performances at low FLOPs. Codes are available at: Efficiently modeling spatial–temporal information in videos is crucial for action recognition. To achieve this goal, state-of-the-art methods typically employ the convolution operator and the dense interaction modules such as non-local blocks. However, these methods cannot accurately fit the diverse events in videos. On the one hand, the adopted convolutions are with fixed scales, thus struggling with events of various scales. On the other hand, the dense interaction modeling paradigm only achieves sub-optimal performance as action-irrelevant parts bring additional noises for the final prediction. In this paper, we propose a unified action recognition framework to investigate the dynamic nature of video content by introducing the following designs. First, when extracting local cues, we generate the spatial–temporal kernels of dynamic-scale to adaptively fit the diverse events. Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer, which yields a sparse paradigm. We call the proposed framework as Event Adaptive Network because both key designs are adaptive to the input video content. To exploit the short-term motions within local segments, we propose a novel and efficient Latent Motion Code module, further improving the performance of the framework. Extensive experiments on several large-scale video datasets, e.g., Something-to-Something V1 &V2, Kinetics, and Diving48, verify that our models achieve state-of-the-art or competitive performances at low FLOPs. Codes are available at: https://github.com/tianyuan168326/EAN-Pytorch.
Audience	Academic
Author	Guo, Guodong Tian, Yuan Yan, Yichao Gao, Zhiyong Zhai, Guangtao
Author_xml	– sequence: 1 givenname: Yuan surname: Tian fullname: Tian, Yuan organization: Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University – sequence: 2 givenname: Yichao orcidid: 0000-0003-3209-8965 surname: Yan fullname: Yan, Yichao email: yanyichao@sjtu.edu.cn organization: Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University, AI Institute, Shanghai Jiao Tong University – sequence: 3 givenname: Guangtao surname: Zhai fullname: Zhai, Guangtao email: zhaiguangtao@sjtu.edu.cn organization: Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University – sequence: 4 givenname: Guodong surname: Guo fullname: Guo, Guodong organization: Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University – sequence: 5 givenname: Zhiyong surname: Gao fullname: Gao, Zhiyong organization: Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University
BookMark	eNp9kc9PHCEUx0mjSdfVf6CnSXrqYex7MAwzepqYbTUxNvHHmbDAbLErbIHV9r-X7ZgYezAcIOTz4T3e94Ds-eAtIZ8QjhFAfE2ItGU1UFoDti3W-IHMkAtWYwN8j8ygp1DztseP5CClewCgHWUzcroYrk6qxaP1uRqM2mT3aKsrm59C_FWNIVYL_1N5bU016OyCr66tDivvdudDsj-qdbJHL_uc3H1b3J6d15c_vl-cDZe1bniXa2NEY7SinYJ-XLacK96oVrelG0PBKCqgBzVqi9wIgXTJGOplaxCsYKrr2Zx8nt7dxPB7a1OW92EbfSkpaeEZB8GaQh1P1EqtrXR-DDkqXZaxD06XcY2u3A8CyxB6WrQ5-fJGKEy2f_JKbVOSFzfXb1k6sTqGlKId5Sa6BxX_SgS5S0BOCciSgPyXgMQidf9J2mW1m1zpzK3fV9mkplLHr2x8_fI71jMv0Jkx
CitedBy_id	crossref_primary_10_1007_s11227_023_05374_1 crossref_primary_10_1016_j_irbm_2024_100841 crossref_primary_10_1016_j_eswa_2024_124596 crossref_primary_10_1016_j_neucom_2024_128291 crossref_primary_10_1109_TNNLS_2023_3321141 crossref_primary_10_1016_j_cviu_2024_104109 crossref_primary_10_1145_3654671 crossref_primary_10_3390_e24111663 crossref_primary_10_1016_j_engappai_2024_108247 crossref_primary_10_1109_TCSVT_2022_3207518 crossref_primary_10_1109_TCSVT_2024_3397927 crossref_primary_10_1109_TIP_2023_3242774 crossref_primary_10_1016_j_cviu_2024_104198 crossref_primary_10_1007_s11263_024_02272_8 crossref_primary_10_1109_JSEN_2024_3363042 crossref_primary_10_1016_j_cviu_2024_104150 crossref_primary_10_1016_j_jfranklin_2022_12_016 crossref_primary_10_1109_TCSVT_2023_3235522 crossref_primary_10_1016_j_knosys_2024_111852 crossref_primary_10_1007_s11554_024_01541_6 crossref_primary_10_1109_TPAMI_2024_3367879 crossref_primary_10_1007_s00530_023_01132_8
Cites_doi	10.1109/ICCV.2019.00718 10.1109/ICCV48922.2021.01345 10.1145/3343031.3350876 10.1109/CVPR.2017.179 10.1007/978-3-319-46484-8_2 10.1109/ICCV48922.2021.00675 10.1007/s11263-021-01508-1 10.1109/CVPR46437.2021.01625 10.1007/s11263-018-1111-5 10.1007/978-3-540-74936-3_22 10.1109/ICCV.2019.00209 10.1609/aaai.v31i1.11231 10.1109/ICCVW.2017.373 10.1007/s11263-019-01225-w 10.1109/ICCV.2017.622 10.1109/CVPR.2018.00716 10.1109/ICCV48922.2021.00676 10.1109/CVPR.2017.502 10.1109/ICCV.2015.510 10.1109/ICME.2019.00055 10.1007/978-3-030-01231-1_32 10.1007/978-3-030-01216-8_43 10.1109/ICCV48922.2021.00445 10.1109/ICCV48922.2021.01606 10.1109/TPAMI.2016.2577031 10.1109/CVPR.2018.00931 10.1109/ICCV48922.2021.01332 10.1109/CVPR.2018.00813 10.1109/CVPR.2015.7298594 10.1109/CVPR.2016.213 10.1109/ICCV.2019.00630 10.1109/CVPR42600.2020.00113 10.1023/A:1008155721192 10.1109/CVPR.2016.90 10.1109/ICCV.2019.00561 10.1109/TPAMI.2019.2938758 10.1007/s11263-019-01248-3 10.1007/978-3-030-01246-5_49 10.1109/CVPR.2019.00033 10.1109/CVPR.2018.00155 10.1109/CVPRW.2019.00302 10.1007/978-3-030-58568-6_5 10.1007/978-3-030-01267-0_19 10.1016/j.cviu.2021.103219 10.1109/CVPR42600.2020.01104 10.1109/CVPR.2018.00675 10.1109/CVPR.2009.5206848 10.1007/s11263-016-0934-1 10.1007/s11263-018-1129-8 10.1109/CVPR.2016.308 10.1109/TPAMI.2018.2868668 10.1109/CVPR.2017.291 10.1109/ICCV48922.2021.01325 10.1007/978-3-030-01228-1_25 10.1609/aaai.v34i07.6836 10.1007/978-3-030-58517-4_21 10.1109/CVPR46437.2021.00193 10.1109/CVPR.2018.00710 10.1109/CVPR42600.2020.00043 10.1007/s11263-021-01486-4 10.1109/CVPR42600.2020.00099
ContentType	Journal Article
Copyright	The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022. Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. COPYRIGHT 2022 Springer
Copyright_xml	– notice: The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022. Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. – notice: COPYRIGHT 2022 Springer
DBID	AAYXX CITATION ISR 3V. 7SC 7WY 7WZ 7XB 87Z 8AL 8FD 8FE 8FG 8FK 8FL ABUWG AFKRA ARAPS AZQEC BENPR BEZIV BGLVJ CCPQU DWQXO FRNLG F~G GNUQQ HCIFZ JQ2 K60 K6~ K7- L.- L7M L~C L~D M0C M0N P5Z P62 PHGZM PHGZT PKEHL PQBIZ PQBZA PQEST PQGLB PQQKQ PQUKI PYYUZ Q9U
DOI	10.1007/s11263-022-01661-1
DatabaseName	CrossRef Gale In Context: Science ProQuest Central (Corporate) Computer and Information Systems Abstracts ABI/INFORM Collection ABI/INFORM Global (PDF only) ProQuest Central (purchase pre-March 2016) ABI/INFORM Global (Alumni Edition) Computing Database (Alumni Edition) Technology Research Database ProQuest SciTech Collection ProQuest Technology Collection ProQuest Central (Alumni) (purchase pre-March 2016) ABI/INFORM Collection (Alumni Edition) ProQuest Central (Alumni Edition) ProQuest Central UK/Ireland Advanced Technologies & Aerospace Collection ProQuest Central Essentials ProQuest Central Business Premium Collection Technology Collection ProQuest One Community College ProQuest Central Korea Business Premium Collection (Alumni) ABI/INFORM Global (Corporate) ProQuest Central Student SciTech Premium Collection ProQuest Computer Science Collection ProQuest Business Collection (Alumni Edition) ProQuest Business Collection Computer Science Database ABI/INFORM Professional Advanced Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional ABI/INFORM Global Computing Database Advanced Technologies & Aerospace Database ProQuest Advanced Technologies & Aerospace Collection ProQuest Central Premium ProQuest One Academic (New) ProQuest One Academic Middle East (New) ProQuest One Business ProQuest One Business (Alumni) ProQuest One Academic Eastern Edition (DO NOT USE) ProQuest One Applied & Life Sciences ProQuest One Academic ProQuest One Academic UKI Edition ABI/INFORM Collection China ProQuest Central Basic
DatabaseTitle	CrossRef ABI/INFORM Global (Corporate) ProQuest Business Collection (Alumni Edition) ProQuest One Business Computer Science Database ProQuest Central Student Technology Collection Technology Research Database Computer and Information Systems Abstracts – Academic ProQuest One Academic Middle East (New) ProQuest Advanced Technologies & Aerospace Collection ProQuest Central Essentials ProQuest Computer Science Collection Computer and Information Systems Abstracts ProQuest Central (Alumni Edition) SciTech Premium Collection ProQuest One Community College ABI/INFORM Complete ProQuest Central ABI/INFORM Professional Advanced ProQuest One Applied & Life Sciences ProQuest Central Korea ProQuest Central (New) Advanced Technologies Database with Aerospace ABI/INFORM Complete (Alumni Edition) Advanced Technologies & Aerospace Collection Business Premium Collection ABI/INFORM Global ProQuest Computing ABI/INFORM Global (Alumni Edition) ProQuest Central Basic ProQuest Computing (Alumni Edition) ProQuest One Academic Eastern Edition ABI/INFORM China ProQuest Technology Collection ProQuest SciTech Collection ProQuest Business Collection Computer and Information Systems Abstracts Professional Advanced Technologies & Aerospace Database ProQuest One Academic UKI Edition ProQuest One Business (Alumni) ProQuest One Academic ProQuest Central (Alumni) ProQuest One Academic (New) Business Premium Collection (Alumni)
DatabaseTitleList	ABI/INFORM Global (Corporate)
Database_xml	– sequence: 1 dbid: 8FG name: ProQuest Technology Collection url: https://search.proquest.com/technologycollection1 sourceTypes: Aggregation Database
DeliveryMethod	fulltext_linktorsrc
Discipline	Applied Sciences Computer Science
EISSN	1573-1405
EndPage	2471
ExternalDocumentID	A716919212 10_1007_s11263_022_01661_1
GroupedDBID	-4Z -59 -5G -BR -EM -Y2 -~C .4S .86 .DC .VR 06D 0R~ 0VY 199 1N0 1SB 2.D 203 28- 29J 2J2 2JN 2JY 2KG 2KM 2LR 2P1 2VQ 2~H 30V 3V. 4.4 406 408 409 40D 40E 5GY 5QI 5VS 67Z 6NX 6TJ 78A 7WY 8FE 8FG 8FL 8TC 8UJ 95- 95. 95~ 96X AAAVM AABHQ AACDK AAHNG AAIAL AAJBT AAJKR AANZL AAOBN AARHV AARTL AASML AATNV AATVU AAUYE AAWCG AAYIU AAYQN AAYTO AAYZH ABAKF ABBBX ABBXA ABDBF ABDZT ABECU ABFTD ABFTV ABHLI ABHQN ABJNI ABJOX ABKCH ABKTR ABMNI ABMQK ABNWP ABQBU ABQSL ABSXP ABTEG ABTHY ABTKH ABTMW ABULA ABUWG ABWNU ABXPI ACAOD ACBXY ACDTI ACGFO ACGFS ACHSB ACHXU ACIHN ACKNC ACMDZ ACMLO ACOKC ACOMO ACPIV ACREN ACUHS ACZOJ ADHHG ADHIR ADIMF ADINQ ADKNI ADKPE ADMLS ADRFC ADTPH ADURQ ADYFF ADYOE ADZKW AEAQA AEBTG AEFIE AEFQL AEGAL AEGNC AEJHL AEJRE AEKMD AEMSY AENEX AEOHA AEPYU AESKC AETLH AEVLU AEXYK AFBBN AFEXP AFGCZ AFKRA AFLOW AFQWF AFWTZ AFYQB AFZKB AGAYW AGDGC AGGDS AGJBK AGMZJ AGQEE AGQMX AGRTI AGWIL AGWZB AGYKE AHAVH AHBYD AHKAY AHSBF AHYZX AIAKS AIGIU AIIXL AILAN AITGF AJBLW AJRNO AJZVZ ALMA_UNASSIGNED_HOLDINGS ALWAN AMKLP AMTXH AMXSW AMYLF AMYQR AOCGG ARAPS ARCSS ARMRJ ASPBG AVWKF AXYYD AYJHY AZFZN AZQEC B-. B0M BA0 BBWZM BDATZ BENPR BEZIV BGLVJ BGNMA BPHCQ BSONS CAG CCPQU COF CS3 CSCUP DDRTE DL5 DNIVK DPUIP DU5 DWQXO EAD EAP EAS EBLON EBS EDO EIOEI EJD EMK EPL ESBYG ESX F5P FEDTE FERAY FFXSO FIGPU FINBP FNLPD FRNLG FRRFC FSGXE FWDCC GGCAI GGRSB GJIRD GNUQQ GNWQR GQ6 GQ7 GQ8 GROUPED_ABI_INFORM_COMPLETE GXS H13 HCIFZ HF~ HG5 HG6 HMJXF HQYDN HRMNR HVGLF HZ~ I-F I09 IAO IHE IJ- IKXTQ ISR ITC ITM IWAJR IXC IZIGR IZQ I~X I~Y I~Z J-C J0Z JBSCW JCJTX JZLTJ K60 K6V K6~ K7- KDC KOV KOW LAK LLZTM M0C M0N M4Y MA- N2Q N9A NB0 NDZJH NPVJJ NQJWS NU0 O9- O93 O9G O9I O9J OAM OVD P19 P2P P62 P9O PF0 PQBIZ PQBZA PQQKQ PROAC PT4 PT5 QF4 QM1 QN7 QO4 QOK QOS R4E R89 R9I RHV RNI RNS ROL RPX RSV RZC RZE RZK S16 S1Z S26 S27 S28 S3B SAP SCJ SCLPG SCO SDH SDM SHX SISQX SJYHP SNE SNPRN SNX SOHCF SOJ SPISZ SRMVM SSLCW STPWE SZN T13 T16 TAE TEORI TSG TSK TSV TUC TUS U2A UG4 UOJIU UTJUX UZXMN VC2 VFIZW W23 W48 WK8 YLTOR Z45 Z7R Z7S Z7V Z7W Z7X Z7Y Z7Z Z83 Z86 Z88 Z8M Z8N Z8P Z8Q Z8R Z8S Z8T Z8W Z92 ZMTXR ~8M ~EX AAPKM AAYXX ABBRH ABDBE ABFSG ACMFV ACSTC ADHKG ADKFA AEZWR AFDZB AFHIU AFOHR AGQPQ AHPBZ AHWEU AIXLP ATHPR AYFIA CITATION ICD PHGZM PHGZT AEIIB PMFND 7SC 7XB 8AL 8FD 8FK ABRTQ JQ2 L.- L7M L~C L~D PKEHL PQEST PQGLB PQUKI Q9U
ID	FETCH-LOGICAL-c458t-dd74dca28a09fb655a54a6c6569d20da27090afce15d7712b331cb6d10e73a893
IEDL.DBID	BENPR
ISSN	0920-5691
IngestDate	Wed Aug 13 06:33:06 EDT 2025 Tue Jun 10 21:04:24 EDT 2025 Fri Jun 27 05:27:10 EDT 2025 Thu Apr 24 23:11:06 EDT 2025 Tue Jul 01 04:30:58 EDT 2025 Fri Feb 21 02:44:53 EST 2025
IsPeerReviewed	true
IsScholarly	true
Issue	10
Keywords	Action recognition Dynamic neural networks Vision transformers Motion representation
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c458t-dd74dca28a09fb655a54a6c6569d20da27090afce15d7712b331cb6d10e73a893
Notes	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ORCID	0000-0003-3209-8965
PQID	2712350734
PQPubID	1456341
PageCount	19
ParticipantIDs	proquest_journals_2712350734 gale_infotracacademiconefile_A716919212 gale_incontextgauss_ISR_A716919212 crossref_primary_10_1007_s11263_022_01661_1 crossref_citationtrail_10_1007_s11263_022_01661_1 springer_journals_10_1007_s11263_022_01661_1
ProviderPackageCode	CITATION AAYXX
PublicationCentury	2000
PublicationDate	2022-10-01
PublicationDateYYYYMMDD	2022-10-01
PublicationDate_xml	– month: 10 year: 2022 text: 2022-10-01 day: 01
PublicationDecade	2020
PublicationPlace	New York
PublicationPlace_xml	– name: New York
PublicationTitle	International journal of computer vision
PublicationTitleAbbrev	Int J Comput Vis
PublicationYear	2022
Publisher	Springer US Springer Springer Nature B.V
Publisher_xml	– name: Springer US – name: Springer – name: Springer Nature B.V
References	Zhang, Y., Li, X., Liu, C., Shuai, B., Zhu, Y., Brattoli, B., Chen, H., Marsic, I., & Tighe, J. (2021). Vidtr: Video transformer without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 13577–13587). Tian, Y., Lu, G., Min, X., Che, Z., Zhai, G., Guo, G., & Gao, Z. (2021). Self-conditioned probabilistic learning of video rescaling. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 4490–4499). Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., & Schmid, C. (2021). Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6836–6846). FerrymanJMMaybankSJWorrallADVisual surveillance for moving vehiclesInternational Journal of Computer Vision200037218719710.1023/A:1008155721192 Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision (pp. 20–36). Springer. Tian, Y., Min, X., Zhai, G., & Gao, Z. (2019). Video-based early asd detection via temporal pyramid networks. In 2019 IEEE international conference on multimedia and expo (pp. 272–277). IEEE. Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision (pp. 4489–4497). FeichtenhoferCPinzAWildesRPZissermanADeep insights into convolutional networks for video recognitionInternational Journal of Computer Vision2020128242043710.1007/s11263-019-01225-w RenSHeKGirshickRSunJFaster r-cnn: Towards real-time object detection with region proposal networksIEEE Transactions on Pattern Analysis and Machine Intelligence20163961137114910.1109/TPAMI.2016.2577031 Li, Y., Li, Y., & Vasconcelos, N. (2018). Resound: Towards action recognition without representation bias. In Proceedings of the European conference on computer vision (pp. 513–528). Girdhar, R., Carreira, J., Doersch, C., & Zisserman, A. (2019). Video action transformer network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 244–253). Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1–9). Ranjan, A., & Black, M. J. (2017). Optical flow estimation using a spatial pyramid network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4161–4170). Jiang, B., Wang, M., Gan, W., Wu, W., & Yan, J. (2019). Stm: Spatiotemporal and motion encoding for action recognition. In Proceedings of the IEEE international conference on computer vision (pp. 2000–2009). Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. A. (2017). Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-first AAAI conference on artificial intelligence. Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., & Wang, L. (2020). Tea: Temporal excitation and aggregation for action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 909–918). Zach, C., Pock, T., & Bischof, H. (2007). A duality based approach for realtime tv-l 1 optical flow. In Joint pattern recognition symposium (pp. 214–223). Springer. Wang, H., Tran, D., Torresani, L., & Feiszli, M. (2020). Video modeling with correlation networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 352–361). Tian, Y., Che, Z., Bao, W., Zhai, G., & Gao, Z. (2020). Self-supervised motion representation via scattering local motion cues. In European conference on computer vision (pp. 71–89). Springer Kanojia, G., Kumawat, S., & Raman, S. (2019). Attentive spatio-temporal representation learning for diving classification. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops. Sun, D., Yang, X., Liu, M. Y., & Kautz, J. (2018). Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8934–8943). Wang, L., Tong, Z., Ji, B., & Wu, G. (2020). Tdn: Temporal difference networks for efficient action recognition. arXiv:2012.10071 Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. In Proceedings of the IEEE international conference on computer vision (pp. 6202–6211). Srinivas, A., Lin, T. Y., Parmar, N., Shlens, J., Abbeel, P., & Vaswani, A. (2021). Bottleneck transformers for visual recognition. arXiv:2101.11605 LuCShiJWangWJiaJFast abnormal event detectionInternational Journal of Computer Vision20191278993101110.1007/s11263-018-1129-8 Chen, Y., Dai, X., Liu, M., Chen, D., Yuan, L., & Liu, Z. (2020). Dynamic convolution: Attention over convolution kernels. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11030–11039). Goyal, R., Kahou, S. E., Michalski, V., Materzynska, J., Westphal, S., Kim, H., et al. (2017). The“something something” video database for learning and evaluating visual common sense. Proceedings of the IEEE International Conference on Computer Vision,1, 5. Girdhar, R., & Grauman, K. (2021). Anticipative video transformer. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 13505–13515). Mahdisoltani, F., Berger, G., Gharbieh, W., Fleet, D., & Memisevic, R. (2018). Fine-grained video classification and captioning. arXiv:1804.092355(6) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. arXiv:1706.03762 He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778). CherianAGouldSSecond-order temporal pooling for action recognitionInternational Journal of Computer Vision2019127434036210.1007/s11263-018-1111-5 Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6299–6308). Wang, L., Li, W., Li, W., & Van Gool, L. (2018). Appearance-and-relation networks for video classification. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1430–1439). Wang, X., & Gupta, A. (2018). Videos as space-time region graphs. In Proceedings of the European conference on computer vision (pp. 399–417). Ma, C. Y., Kadav, A., Melvin, I., Kira, Z., AlRegib, G., & Peter Graf, H. (2018). Attend and interact: Higher-order object interactions for video understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6790–6800). Yang, B., Bender, G., Le, Q.V., & Ngiam, J. (2019). Condconv: Conditionally parameterized convolutions for efficient inference. arXiv:1904.04971 Bertasius, G., Wang, H., &Torresani, L. (2021). Is space-time attention all you need for video understanding? arXiv:2102.05095 Feichtenhofer, C., Pinz, A., & Zisserman, A. (2016). Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1933–1941). Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., & Paluri, M. (2018). A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 6450–6459). Wu, Z., Li, H., Zheng, Y., Xiong, C., Jiang, Y. G., & Davis, L. S. (2021). A coarse-to-fine framework for resource efficient video recognition. International Journal of Computer Vision. PlizzariCCanniciMMatteucciMSkeleton-based action recognition via spatial and temporal transformer networksComputer Vision and Image Understanding202120810.1016/j.cviu.2021.103219 Kwon, H., Kim, M., Kwak, S., & Cho, M. (2020). Motionsqueeze: Neural motion feature learning for video understanding. In European conference on computer vision (pp. 345–362). Springer. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2818–2826). Cong, Y., Liao, W., Ackermann, H., Rosenhahn, B., & Yang, M. Y. (2021). Spatial-temporal transformer for dynamic scene graph generation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 16372–16382). BenschRScherfNHuiskenJBroxTRonnebergerOSpatiotemporal deformable prototypes for motion anomaly detectionInternational Journal of Computer Vision2017122350252310.1007/s11263-016-0934-1 Zhang, X., Zhou, X., Lin, M., & Sun, J. (2018). Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6848–6856). Liu, Z., Wang, L., Wu, W., Qian, C., & Lu, T. (2020). Tam: Temporal adaptive module for video recognition. arXiv:2005.06803 Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7794–7803). Liu, Z., Luo, D., Wang, Y., Wang, L., Tai, Y., Wang, C., Li, J., Huang, F., & Lu, T. (2020). Teinet: Towards an efficient architecture for video recognition. In Proceedings of the AAAI conference on artificial intelligence (Vol. 34, pp. 11669–11676). Luo, C., & Yuille, A. L. (2019). Grouped spatial-temporal aggregation for efficient action recognition. In Proceedings of the IEEE international conference on computer vision (pp. 5512–5521). ChenXPangAYangWMaYXuLYuJSportscap: Monocular 3d human motion capture and fine-grained understanding in challenging sports videosInternational Journal of Computer Vis 1661_CR61 1661_CR63 1661_CR62 1661_CR21 1661_CR65 1661_CR20 1661_CR64 1661_CR23 C Plizzari (1661_CR40) 2021; 208 1661_CR67 1661_CR22 1661_CR66 1661_CR69 1661_CR24 1661_CR68 1661_CR27 1661_CR26 1661_CR29 1661_CR28 C Feichtenhofer (1661_CR15) 2020; 128 1661_CR8 1661_CR6 1661_CR4 1661_CR50 1661_CR5 1661_CR52 1661_CR3 1661_CR51 1661_CR10 1661_CR54 1661_CR1 1661_CR53 1661_CR12 1661_CR56 1661_CR11 1661_CR55 1661_CR14 S Ren (1661_CR42) 2016; 39 1661_CR58 1661_CR13 X Jia (1661_CR25) 2016; 29 1661_CR57 1661_CR16 C Lu (1661_CR35) 2019; 127 1661_CR59 JM Ferryman (1661_CR17) 2000; 37 R Bensch (1661_CR2) 2017; 122 1661_CR19 1661_CR41 1661_CR43 1661_CR45 1661_CR44 1661_CR47 1661_CR46 1661_CR49 1661_CR48 SH Gao (1661_CR18) 2019; 43 X Chen (1661_CR7) 2021; 129 1661_CR70 1661_CR71 1661_CR30 1661_CR32 1661_CR31 1661_CR34 1661_CR33 1661_CR36 A Cherian (1661_CR9) 2019; 127 1661_CR38 1661_CR37 L Wang (1661_CR60) 2018; 41 1661_CR39
References_xml	– reference: Girdhar, R., Carreira, J., Doersch, C., & Zisserman, A. (2019). Video action transformer network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 244–253). – reference: Cong, Y., Liao, W., Ackermann, H., Rosenhahn, B., & Yang, M. Y. (2021). Spatial-temporal transformer for dynamic scene graph generation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 16372–16382). – reference: Wu, Z., Li, H., Zheng, Y., Xiong, C., Jiang, Y. G., & Davis, L. S. (2021). A coarse-to-fine framework for resource efficient video recognition. International Journal of Computer Vision. – reference: CherianAGouldSSecond-order temporal pooling for action recognitionInternational Journal of Computer Vision2019127434036210.1007/s11263-018-1111-5 – reference: Khowaja, S. A., & Lee, S. L. (2020). Semantic image networks for human action recognition. International Journal of Computer Vision. – reference: Feichtenhofer, C., Pinz, A., & Zisserman, A. (2016). Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1933–1941). – reference: Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., & Wang, L. (2020). Tea: Temporal excitation and aggregation for action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 909–918). – reference: Tian, Y., Min, X., Zhai, G., & Gao, Z. (2019). Video-based early asd detection via temporal pyramid networks. In 2019 IEEE international conference on multimedia and expo (pp. 272–277). IEEE. – reference: Wang, L., Tong, Z., Ji, B., & Wu, G. (2020). Tdn: Temporal difference networks for efficient action recognition. arXiv:2012.10071 – reference: Ranjan, A., & Black, M. J. (2017). Optical flow estimation using a spatial pyramid network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4161–4170). – reference: Sun, D., Yang, X., Liu, M. Y., & Kautz, J. (2018). Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8934–8943). – reference: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. arXiv:1706.03762 – reference: BenschRScherfNHuiskenJBroxTRonnebergerOSpatiotemporal deformable prototypes for motion anomaly detectionInternational Journal of Computer Vision2017122350252310.1007/s11263-016-0934-1 – reference: Zhang, X., Zhou, X., Lin, M., & Sun, J. (2018). Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6848–6856). – reference: Ma, C. Y., Kadav, A., Melvin, I., Kira, Z., AlRegib, G., & Peter Graf, H. (2018). Attend and interact: Higher-order object interactions for video understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6790–6800). – reference: Lin, J., Gan, C., & Han, S. (2019). Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE international conference on computer vision (pp. 7083–7093). – reference: Tian, Y., Che, Z., Bao, W., Zhai, G., & Gao, Z. (2020). Self-supervised motion representation via scattering local motion cues. In European conference on computer vision (pp. 71–89). Springer – reference: Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., & Brox, T. (2017). Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2462–2470). – reference: Liu, Z., Luo, D., Wang, Y., Wang, L., Tai, Y., Wang, C., Li, J., Huang, F., & Lu, T. (2020). Teinet: Towards an efficient architecture for video recognition. In Proceedings of the AAAI conference on artificial intelligence (Vol. 34, pp. 11669–11676). – reference: Zolfaghari, M., Singh, K., & Brox, T. (2018). Eco: Efficient convolutional network for online video understanding. In Proceedings of the European conference on computer vision (pp. 695–712). – reference: FerrymanJMMaybankSJWorrallADVisual surveillance for moving vehiclesInternational Journal of Computer Vision200037218719710.1023/A:1008155721192 – reference: Yang, B., Bender, G., Le, Q.V., & Ngiam, J. (2019). Condconv: Conditionally parameterized convolutions for efficient inference. arXiv:1904.04971 – reference: Materzynska, J., Xiao, T., Herzig, R., Xu, H., Wang, X., & Darrell, T. (2020). Something-else: Compositional action recognition with spatial-temporal interaction networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1049–1059). – reference: Wang, H., Tran, D., Torresani, L., & Feiszli, M. (2020). Video modeling with correlation networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 352–361). – reference: PlizzariCCanniciMMatteucciMSkeleton-based action recognition via spatial and temporal transformer networksComputer Vision and Image Understanding202120810.1016/j.cviu.2021.103219 – reference: Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., & Paluri, M. (2018). A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 6450–6459). – reference: Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision (pp. 4489–4497). – reference: Kwon, H., Kim, M., Kwak, S., & Cho, M. (2020). Motionsqueeze: Neural motion feature learning for video understanding. In European conference on computer vision (pp. 345–362). Springer. – reference: Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1–9). – reference: Tian, Y., Lu, G., Min, X., Che, Z., Zhai, G., Guo, G., & Gao, Z. (2021). Self-conditioned probabilistic learning of video rescaling. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 4490–4499). – reference: Kanojia, G., Kumawat, S., & Raman, S. (2019). Attentive spatio-temporal representation learning for diving classification. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops. – reference: Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. A. (2017). Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-first AAAI conference on artificial intelligence. – reference: Wang, L., Li, W., Li, W., & Van Gool, L. (2018). Appearance-and-relation networks for video classification. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1430–1439). – reference: FeichtenhoferCPinzAWildesRPZissermanADeep insights into convolutional networks for video recognitionInternational Journal of Computer Vision2020128242043710.1007/s11263-019-01225-w – reference: WangLXiongYWangZQiaoYLinDTangXVan GoolLTemporal segment networks for action recognition in videosIEEE Transactions on Pattern Analysis and Machine Intelligence201841112740275510.1109/TPAMI.2018.2868668 – reference: Girdhar, R., & Grauman, K. (2021). Anticipative video transformer. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 13505–13515). – reference: Goyal, R., Kahou, S. E., Michalski, V., Materzynska, J., Westphal, S., Kim, H., et al. (2017). The“something something” video database for learning and evaluating visual common sense. Proceedings of the IEEE International Conference on Computer Vision,1, 5. – reference: Mahdisoltani, F., Berger, G., Gharbieh, W., Fleet, D., & Memisevic, R. (2018). Fine-grained video classification and captioning. arXiv:1804.092355(6) – reference: Bertasius, G., Wang, H., &Torresani, L. (2021). Is space-time attention all you need for video understanding? arXiv:2102.05095 – reference: Liu, Z., Wang, L., Wu, W., Qian, C., & Lu, T. (2020). Tam: Temporal adaptive module for video recognition. arXiv:2005.06803 – reference: Bulat, A., Perez Rua, J. M., Sudhakaran, S., Martinez, B., & Tzimiropoulos, G. (2021). Space-time mixing attention for video transformer. Advances in Neural Information Processing Systems,34 – reference: Hara, K., Kataoka, H., & Satoh, Y. (2017). Learning spatio-temporal features with 3d residual networks for action recognition. In Proceedings of the IEEE international conference on computer vision workshops (pp. 3154–3160). – reference: Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2818–2826). – reference: Zhang, Y., Li, X., Liu, C., Shuai, B., Zhu, Y., Brattoli, B., Chen, H., Marsic, I., & Tighe, J. (2021). Vidtr: Video transformer without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 13577–13587). – reference: JiaXDe BrabandereBTuytelaarsTGoolLVDynamic filter networksAdvances in Neural Information Processing Systems201629667675 – reference: ChenXPangAYangWMaYXuLYuJSportscap: Monocular 3d human motion capture and fine-grained understanding in challenging sports videosInternational Journal of Computer Vision2021129102846286410.1007/s11263-021-01486-4 – reference: Luo, C., & Yuille, A. L. (2019). Grouped spatial-temporal aggregation for efficient action recognition. In Proceedings of the IEEE international conference on computer vision (pp. 5512–5521). – reference: Zhou, B., Andonian, A., Oliva, A., & Torralba, A. (2018). Temporal relational reasoning in videos. In Proceedings of the European conference on computer vision (pp. 803–818). – reference: Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. In Proceedings of the IEEE international conference on computer vision (pp. 6202–6211). – reference: Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6299–6308). – reference: Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision (pp. 20–36). Springer. – reference: RenSHeKGirshickRSunJFaster r-cnn: Towards real-time object detection with region proposal networksIEEE Transactions on Pattern Analysis and Machine Intelligence20163961137114910.1109/TPAMI.2016.2577031 – reference: LuCShiJWangWJiaJFast abnormal event detectionInternational Journal of Computer Vision20191278993101110.1007/s11263-018-1129-8 – reference: Jiang, B., Wang, M., Gan, W., Wu, W., & Yan, J. (2019). Stm: Spatiotemporal and motion encoding for action recognition. In Proceedings of the IEEE international conference on computer vision (pp. 2000–2009). – reference: GaoSHChengMMZhaoKZhangXYYangMHTorrPRes2net: A new multi-scale backbone architectureIEEE Transactions on Pattern Analysis and Machine Intelligence201943265266210.1109/TPAMI.2019.2938758 – reference: He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778). – reference: Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., & Feichtenhofer, C. (2021). Multiscale vision transformers. arXiv:2104.11227 – reference: Xie, S., Sun, C., Huang, J., Tu, Z., & Murphy, K. (2018). Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European conference on computer vision (pp. 305–321). – reference: Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248–255). Ieee. – reference: Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems (pp. 568–576). – reference: Bertasius, G., Feichtenhofer, C., Tran, D., Shi, J., & Torresani, L. (2018). Learning discriminative motion features through detection. arXiv:1812.04172 – reference: Zhang, C., Zou, Y., Chen, G., & Gan, L. (2020). Pan: Towards fast action recognition via learning persistence of appearance. arXiv:2008.03462 – reference: Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., & Schmid, C. (2021). Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6836–6846). – reference: Chen, Y., Dai, X., Liu, M., Chen, D., Yuan, L., & Liu, Z. (2020). Dynamic convolution: Attention over convolution kernels. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11030–11039). – reference: Zach, C., Pock, T., & Bischof, H. (2007). A duality based approach for realtime tv-l 1 optical flow. In Joint pattern recognition symposium (pp. 214–223). Springer. – reference: Wang, X., & Gupta, A. (2018). Videos as space-time region graphs. In Proceedings of the European conference on computer vision (pp. 399–417). – reference: Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2020). Training data-efficient image transformers and distillation through attention. arXiv:2012.12877 – reference: Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., & Gelly, S., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929 – reference: Srinivas, A., Lin, T. Y., Parmar, N., Shlens, J., Abbeel, P., & Vaswani, A. (2021). Bottleneck transformers for visual recognition. arXiv:2101.11605 – reference: Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7794–7803). – reference: Li, Y., Li, Y., & Vasconcelos, N. (2018). Resound: Towards action recognition without representation bias. In Proceedings of the European conference on computer vision (pp. 513–528). – ident: 1661_CR32 doi: 10.1109/ICCV.2019.00718 – ident: 1661_CR34 doi: 10.1109/ICCV48922.2021.01345 – ident: 1661_CR67 doi: 10.1145/3343031.3350876 – ident: 1661_CR24 doi: 10.1109/CVPR.2017.179 – ident: 1661_CR59 doi: 10.1007/978-3-319-46484-8_2 – volume: 29 start-page: 667 year: 2016 ident: 1661_CR25 publication-title: Advances in Neural Information Processing Systems – ident: 1661_CR65 – ident: 1661_CR13 doi: 10.1109/ICCV48922.2021.00675 – ident: 1661_CR63 doi: 10.1007/s11263-021-01508-1 – ident: 1661_CR4 – ident: 1661_CR52 – ident: 1661_CR44 doi: 10.1109/CVPR46437.2021.01625 – volume: 127 start-page: 340 issue: 4 year: 2019 ident: 1661_CR9 publication-title: International Journal of Computer Vision doi: 10.1007/s11263-018-1111-5 – ident: 1661_CR66 doi: 10.1007/978-3-540-74936-3_22 – ident: 1661_CR26 doi: 10.1109/ICCV.2019.00209 – ident: 1661_CR46 doi: 10.1609/aaai.v31i1.11231 – ident: 1661_CR22 doi: 10.1109/ICCVW.2017.373 – volume: 128 start-page: 420 issue: 2 year: 2020 ident: 1661_CR15 publication-title: International Journal of Computer Vision doi: 10.1007/s11263-019-01225-w – ident: 1661_CR21 doi: 10.1109/ICCV.2017.622 – ident: 1661_CR68 doi: 10.1109/CVPR.2018.00716 – ident: 1661_CR1 doi: 10.1109/ICCV48922.2021.00676 – ident: 1661_CR6 doi: 10.1109/CVPR.2017.502 – ident: 1661_CR53 doi: 10.1109/ICCV.2015.510 – ident: 1661_CR51 doi: 10.1109/ICME.2019.00055 – ident: 1661_CR31 doi: 10.1007/978-3-030-01231-1_32 – ident: 1661_CR71 doi: 10.1007/978-3-030-01216-8_43 – ident: 1661_CR50 doi: 10.1109/ICCV48922.2021.00445 – ident: 1661_CR10 doi: 10.1109/ICCV48922.2021.01606 – volume: 39 start-page: 1137 issue: 6 year: 2016 ident: 1661_CR42 publication-title: IEEE Transactions on Pattern Analysis and Machine Intelligence doi: 10.1109/TPAMI.2016.2577031 – ident: 1661_CR45 doi: 10.1109/CVPR.2018.00931 – ident: 1661_CR69 doi: 10.1109/ICCV48922.2021.01332 – ident: 1661_CR61 doi: 10.1109/CVPR.2018.00813 – ident: 1661_CR47 doi: 10.1109/CVPR.2015.7298594 – ident: 1661_CR55 – ident: 1661_CR5 – ident: 1661_CR16 doi: 10.1109/CVPR.2016.213 – ident: 1661_CR38 – ident: 1661_CR14 doi: 10.1109/ICCV.2019.00630 – ident: 1661_CR39 doi: 10.1109/CVPR42600.2020.00113 – volume: 37 start-page: 187 issue: 2 year: 2000 ident: 1661_CR17 publication-title: International Journal of Computer Vision doi: 10.1023/A:1008155721192 – ident: 1661_CR23 doi: 10.1109/CVPR.2016.90 – ident: 1661_CR36 doi: 10.1109/ICCV.2019.00561 – volume: 43 start-page: 652 issue: 2 year: 2019 ident: 1661_CR18 publication-title: IEEE Transactions on Pattern Analysis and Machine Intelligence doi: 10.1109/TPAMI.2019.2938758 – ident: 1661_CR28 doi: 10.1007/s11263-019-01248-3 – ident: 1661_CR70 doi: 10.1007/978-3-030-01246-5_49 – ident: 1661_CR19 doi: 10.1109/CVPR.2019.00033 – ident: 1661_CR57 doi: 10.1109/CVPR.2018.00155 – ident: 1661_CR27 doi: 10.1109/CVPRW.2019.00302 – ident: 1661_CR49 doi: 10.1007/978-3-030-58568-6_5 – ident: 1661_CR64 doi: 10.1007/978-3-030-01267-0_19 – volume: 208 year: 2021 ident: 1661_CR40 publication-title: Computer Vision and Image Understanding doi: 10.1016/j.cviu.2021.103219 – ident: 1661_CR8 doi: 10.1109/CVPR42600.2020.01104 – ident: 1661_CR54 doi: 10.1109/CVPR.2018.00675 – ident: 1661_CR11 doi: 10.1109/CVPR.2009.5206848 – volume: 122 start-page: 502 issue: 3 year: 2017 ident: 1661_CR2 publication-title: International Journal of Computer Vision doi: 10.1007/s11263-016-0934-1 – volume: 127 start-page: 993 issue: 8 year: 2019 ident: 1661_CR35 publication-title: International Journal of Computer Vision doi: 10.1007/s11263-018-1129-8 – ident: 1661_CR48 doi: 10.1109/CVPR.2016.308 – ident: 1661_CR12 – volume: 41 start-page: 2740 issue: 11 year: 2018 ident: 1661_CR60 publication-title: IEEE Transactions on Pattern Analysis and Machine Intelligence doi: 10.1109/TPAMI.2018.2868668 – ident: 1661_CR43 – ident: 1661_CR41 doi: 10.1109/CVPR.2017.291 – ident: 1661_CR20 doi: 10.1109/ICCV48922.2021.01325 – ident: 1661_CR62 doi: 10.1007/978-3-030-01228-1_25 – ident: 1661_CR33 doi: 10.1609/aaai.v34i07.6836 – ident: 1661_CR29 doi: 10.1007/978-3-030-58517-4_21 – ident: 1661_CR58 doi: 10.1109/CVPR46437.2021.00193 – ident: 1661_CR37 doi: 10.1109/CVPR.2018.00710 – ident: 1661_CR56 doi: 10.1109/CVPR42600.2020.00043 – ident: 1661_CR3 – volume: 129 start-page: 2846 issue: 10 year: 2021 ident: 1661_CR7 publication-title: International Journal of Computer Vision doi: 10.1007/s11263-021-01486-4 – ident: 1661_CR30 doi: 10.1109/CVPR42600.2020.00099
SSID	ssj0002823
Score	2.571239
Snippet	Efficiently modeling spatial–temporal information in videos is crucial for action recognition. To achieve this goal, state-of-the-art methods typically employ... Efficiently modeling spatial-temporal information in videos is crucial for action recognition. To achieve this goal, state-of-the-art methods typically employ...
SourceID	proquest gale crossref springer
SourceType	Aggregation Database Enrichment Source Index Database Publisher
StartPage	2453
SubjectTerms	Activity recognition Artificial Intelligence Computer Imaging Computer Science Design Image Processing and Computer Vision Interaction models Modelling Modules Pattern Recognition Pattern Recognition and Graphics Sensors Video Vision
SummonAdditionalLinks	– databaseName: SpringerLink Journals (ICM) dbid: U2A link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEA5SL158i9UqQQQPurBJ9qmnRbZUwR6qhd5CNsnqQbbFbf-_k2y2pb7A6-7sIzPJPJj5ZhC6TIgkKimEJ4sIApRIMi-JSuoxHYAx8lkqfQMUfhpGg3HwOAknDhRWt9XubUrSauoV2I1Qm3M0pQRgVTyIeTZDE7vDLh7TbKl_IYhoBshDYBRGKXFQmZ_fsWaOvirlb9lRa3T6u2jbeYs4a8S7hzZ0tY92nOeI3bms4VI7nKG9doDu8mx4i3NTzogzJWZGreFhU_SNwVPFefVms_84s9AGPGpLiabVIRr385f7gecmJXgyCJO5p1QcKCloIvy0LKIwFGEgIgm-WqqorwSN_dQXpdQkVHFMaMEYAeEo4uuYCXBZjlCnmlb6GGHNWCzKRJfC9i4TRaxAA2oBnIZgkAZdRFqGcenaiJtpFu981QDZMJkDk7llMidddL18ZtY00fiT-sLIgZvuFJUpf3kVi7rmD88jntnePimY2y66ckTlFD4vhUMTwCJMQ6s1yl4rT-7OZ81pbDDCoN5gPTetjFe3f_-5k_-Rn6Itanabrf7roc78Y6HPwIuZF-d2034Csa7j2Q priority: 102 providerName: Springer Nature
Title	EAN: Event Adaptive Network for Enhanced Action Recognition
URI	https://link.springer.com/article/10.1007/s11263-022-01661-1 https://www.proquest.com/docview/2712350734
Volume	130
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV1LT9wwEB4Be-mFttCKLbCyqkocaETsvJz2gNI2C-WxqhZWgpPl2E57QNlts_z_jh2HFVTlFMl2XjPj8Yz9zQzAB04V1bySgapSdFBSFQU8rVkQmRgXozDKVWgDhS8n6eksPrtJbvyGW-thlb1OdIpaz5XdIz9imY3qRIGMjxe_A1s1yp6u-hIa6zBAFczR-Rp8KSc_pg-6GB2Krpg8OklJmlMfNtMFz1HmzjAtNAFXqYA-WpqeKuh_TkrdAjR-BZveciRFx-rXsGaaLXjprUji52iLTX2hhr5tGz6XxeQTKS20kRRaLqyKI5MOAE7QaiVl88shAUjhwhzItIcVzZs3MBuX119PA181IVBxwpeB1lmslWRchnldpUkik1imCu22XLNQS5aFeShrZWiiMyRqFUUUGaVpaLJIovnyFjaaeWN2gJgoymTNTS1dHjNZZRq1oZEJWg08ZvEQaE8woXxKcVvZ4k6skiFbIgsksnBEFnQIhw_3LLqEGs-Ofm_5IGymisZCYX7K-7YV36-monB5fnLk-xAO_KB6jq9X0kcW4E_Y5FaPRu71_BR-rrZiJVlD-NjzeNX9_4979_zTduEFs9LlkH97sLH8c2_20YJZViNY5-OTEQyKb5cXV_Z6cntejrzwYu-MFX8BOCntVQ
linkProvider	ProQuest
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1Lb9QwELaqcoALb8RCAQuBOEBE_MirqKoi2GWXtnsordSbcWwHDii7kK0Qf6q_kRnH7qogeuvVcex4PJ5H_M0MIS9KZpgtG52YJgcHJTciKfOWJ8JJUEapqEyKgcIH83x6LD-dZCcb5CzGwiCsMspEL6jtwuA_8re8wKhOYEi5u_yRYNUovF2NJTQGtthzv3-By9bvzD7A_r7kfDI-ej9NQlWBxMisXCXWFtIazUudVm2TZ5nOpM4N2DWV5anVvEirVLfGscwWMGkjBIOFWJa6QugSky-ByL8mhajwRJWTj-eSH9yXoXQ9uGQwHAtBOkOoHuP-xhSBEKATE3ZBEf6tDv65l_XqbnKb3Ax2Kq0HxrpDNlx3l9wKNisNEqGHplgWIrbdI-_G9XybjhFISWurlyhQ6XyAm1Owkem4--ZxB7T2QRX0MIKYFt19cnwl1HxANrtF5x4S6oQodFu6VvusabopLMhepzOwUUrJ5YiwSDBlQgJzrKPxXa1TLyORFRBZeSIrNiKvz99ZDuk7Lu39HPdBYV6MDoE3X_Vp36vZ50NV-6xCFSj6EXkVOrULmN7oEMcAi8BUWhd6bsX9VEEy9GrNxyPyJu7x-vH_P-7R5aM9I9enRwf7an8233tMbnDkNI853CKbq5-n7gnYTqvmqWdYSr5c9Qn5A3kjI_8
linkToPdf	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1Lb9QwEB5VWwlxoTzF0hYsBOIAUWPnDUIo0Ky6FKJqoVJvxrGdckDZLdmq4q_113Xs2F0VRG-9Oo5jj8fziL-ZAXiRU0lV3ohANik6KKmMgjxtWRDpGJVRGBUyNIHCX-t07zD-fJQcrcG5j4UxsEovE62gVnNp_pHvsMxEdSJDxjutg0Uc7E4-LE4CU0HK3LT6choDi-zrP2fovvXvp7u41y8Zm1TfP-0FrsJAIOMkXwZKZbGSguUiLNomTRKRxCKVaOMUioVKsCwsQtFKTROV4QSaKKK4KEVDnUUiN4mYUPyvZ-gVhSNY_1jVB7NLPYDOzFDIHh00HJC6kJ0hcI8ye39qYBGoIQN6RS3-rRz-uaW1ym9yF-44q5WUA5vdgzXd3YcNZ8ESJx96bPJFInzbA3hXlfVbUhlYJSmVWBjxSuoBfE7QYiZV99OiEEhpQyzIzEOa5t1DOLwRej6CUTfv9GMgOooy0ea6FTaHmmgyhZJYiwQtljxm8RioJxiXLp25qarxi68SMRsicyQyt0TmdAyvL99ZDMk8ru393OwDN1kyOsNvx-K07_n024yXNsdQgWp_DK9cp3aOn5fCRTXgIkxirSs9t_x-cicner7i6jG88Xu8evz_yT25frRncAtPB_8yrfc34TYzjGYBiFswWv4-1dtoSC2bp45jCfy46UNyAU8nKZE
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=EAN%3A+Event+Adaptive+Network+for+Enhanced+Action+Recognition&rft.jtitle=International+journal+of+computer+vision&rft.au=Tian%2C+Yuan&rft.au=Yan%2C+Yichao&rft.au=Zhai%2C+Guangtao&rft.au=Guo%2C+Guodong&rft.date=2022-10-01&rft.pub=Springer&rft.issn=0920-5691&rft.volume=130&rft.issue=10&rft.spage=2453&rft_id=info:doi/10.1007%2Fs11263-022-01661-1&rft.externalDBID=ISR&rft.externalDocID=A716919212
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0920-5691&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0920-5691&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0920-5691&client=summon