EAN: Event Adaptive Network for Enhanced Action Recognition

Efficiently modeling spatial–temporal information in videos is crucial for action recognition. To achieve this goal, state-of-the-art methods typically employ the convolution operator and the dense interaction modules such as non-local blocks. However, these methods cannot accurately fit the diverse...

Full description

Saved in:
Bibliographic Details
Published inInternational journal of computer vision Vol. 130; no. 10; pp. 2453 - 2471
Main Authors Tian, Yuan, Yan, Yichao, Zhai, Guangtao, Guo, Guodong, Gao, Zhiyong
Format Journal Article
LanguageEnglish
Published New York Springer US 01.10.2022
Springer
Springer Nature B.V
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Efficiently modeling spatial–temporal information in videos is crucial for action recognition. To achieve this goal, state-of-the-art methods typically employ the convolution operator and the dense interaction modules such as non-local blocks. However, these methods cannot accurately fit the diverse events in videos. On the one hand, the adopted convolutions are with fixed scales, thus struggling with events of various scales. On the other hand, the dense interaction modeling paradigm only achieves sub-optimal performance as action-irrelevant parts bring additional noises for the final prediction. In this paper, we propose a unified action recognition framework to investigate the dynamic nature of video content by introducing the following designs. First, when extracting local cues, we generate the spatial–temporal kernels of dynamic-scale to adaptively fit the diverse events. Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer, which yields a sparse paradigm. We call the proposed framework as Event Adaptive Network because both key designs are adaptive to the input video content. To exploit the short-term motions within local segments, we propose a novel and efficient Latent Motion Code module, further improving the performance of the framework. Extensive experiments on several large-scale video datasets, e.g., Something-to-Something V1 &V2, Kinetics, and Diving48, verify that our models achieve state-of-the-art or competitive performances at low FLOPs. Codes are available at: https://github.com/tianyuan168326/EAN-Pytorch .
AbstractList Efficiently modeling spatial–temporal information in videos is crucial for action recognition. To achieve this goal, state-of-the-art methods typically employ the convolution operator and the dense interaction modules such as non-local blocks. However, these methods cannot accurately fit the diverse events in videos. On the one hand, the adopted convolutions are with fixed scales, thus struggling with events of various scales. On the other hand, the dense interaction modeling paradigm only achieves sub-optimal performance as action-irrelevant parts bring additional noises for the final prediction. In this paper, we propose a unified action recognition framework to investigate the dynamic nature of video content by introducing the following designs. First, when extracting local cues, we generate the spatial–temporal kernels of dynamic-scale to adaptively fit the diverse events. Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer, which yields a sparse paradigm. We call the proposed framework as Event Adaptive Network because both key designs are adaptive to the input video content. To exploit the short-term motions within local segments, we propose a novel and efficient Latent Motion Code module, further improving the performance of the framework. Extensive experiments on several large-scale video datasets, e.g., Something-to-Something V1 &V2, Kinetics, and Diving48, verify that our models achieve state-of-the-art or competitive performances at low FLOPs. Codes are available at: https://github.com/tianyuan168326/EAN-Pytorch .
Efficiently modeling spatial-temporal information in videos is crucial for action recognition. To achieve this goal, state-of-the-art methods typically employ the convolution operator and the dense interaction modules such as non-local blocks. However, these methods cannot accurately fit the diverse events in videos. On the one hand, the adopted convolutions are with fixed scales, thus struggling with events of various scales. On the other hand, the dense interaction modeling paradigm only achieves sub-optimal performance as action-irrelevant parts bring additional noises for the final prediction. In this paper, we propose a unified action recognition framework to investigate the dynamic nature of video content by introducing the following designs. First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events. Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer, which yields a sparse paradigm. We call the proposed framework as Event Adaptive Network because both key designs are adaptive to the input video content. To exploit the short-term motions within local segments, we propose a novel and efficient Latent Motion Code module, further improving the performance of the framework. Extensive experiments on several large-scale video datasets, e.g., Something-to-Something V1 &V2, Kinetics, and Diving48, verify that our models achieve state-of-the-art or competitive performances at low FLOPs. Codes are available at:
Efficiently modeling spatial–temporal information in videos is crucial for action recognition. To achieve this goal, state-of-the-art methods typically employ the convolution operator and the dense interaction modules such as non-local blocks. However, these methods cannot accurately fit the diverse events in videos. On the one hand, the adopted convolutions are with fixed scales, thus struggling with events of various scales. On the other hand, the dense interaction modeling paradigm only achieves sub-optimal performance as action-irrelevant parts bring additional noises for the final prediction. In this paper, we propose a unified action recognition framework to investigate the dynamic nature of video content by introducing the following designs. First, when extracting local cues, we generate the spatial–temporal kernels of dynamic-scale to adaptively fit the diverse events. Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer, which yields a sparse paradigm. We call the proposed framework as Event Adaptive Network because both key designs are adaptive to the input video content. To exploit the short-term motions within local segments, we propose a novel and efficient Latent Motion Code module, further improving the performance of the framework. Extensive experiments on several large-scale video datasets, e.g., Something-to-Something V1 &V2, Kinetics, and Diving48, verify that our models achieve state-of-the-art or competitive performances at low FLOPs. Codes are available at: https://github.com/tianyuan168326/EAN-Pytorch.
Audience Academic
Author Guo, Guodong
Tian, Yuan
Yan, Yichao
Gao, Zhiyong
Zhai, Guangtao
Author_xml – sequence: 1
  givenname: Yuan
  surname: Tian
  fullname: Tian, Yuan
  organization: Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University
– sequence: 2
  givenname: Yichao
  orcidid: 0000-0003-3209-8965
  surname: Yan
  fullname: Yan, Yichao
  email: yanyichao@sjtu.edu.cn
  organization: Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University, AI Institute, Shanghai Jiao Tong University
– sequence: 3
  givenname: Guangtao
  surname: Zhai
  fullname: Zhai, Guangtao
  email: zhaiguangtao@sjtu.edu.cn
  organization: Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University
– sequence: 4
  givenname: Guodong
  surname: Guo
  fullname: Guo, Guodong
  organization: Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University
– sequence: 5
  givenname: Zhiyong
  surname: Gao
  fullname: Gao, Zhiyong
  organization: Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University
BookMark eNp9kc9PHCEUx0mjSdfVf6CnSXrqYex7MAwzepqYbTUxNvHHmbDAbLErbIHV9r-X7ZgYezAcIOTz4T3e94Ds-eAtIZ8QjhFAfE2ItGU1UFoDti3W-IHMkAtWYwN8j8ygp1DztseP5CClewCgHWUzcroYrk6qxaP1uRqM2mT3aKsrm59C_FWNIVYL_1N5bU016OyCr66tDivvdudDsj-qdbJHL_uc3H1b3J6d15c_vl-cDZe1bniXa2NEY7SinYJ-XLacK96oVrelG0PBKCqgBzVqi9wIgXTJGOplaxCsYKrr2Zx8nt7dxPB7a1OW92EbfSkpaeEZB8GaQh1P1EqtrXR-DDkqXZaxD06XcY2u3A8CyxB6WrQ5-fJGKEy2f_JKbVOSFzfXb1k6sTqGlKId5Sa6BxX_SgS5S0BOCciSgPyXgMQidf9J2mW1m1zpzK3fV9mkplLHr2x8_fI71jMv0Jkx
CitedBy_id crossref_primary_10_1007_s11227_023_05374_1
crossref_primary_10_1016_j_irbm_2024_100841
crossref_primary_10_1016_j_eswa_2024_124596
crossref_primary_10_1016_j_neucom_2024_128291
crossref_primary_10_1109_TNNLS_2023_3321141
crossref_primary_10_1016_j_cviu_2024_104109
crossref_primary_10_1145_3654671
crossref_primary_10_3390_e24111663
crossref_primary_10_1016_j_engappai_2024_108247
crossref_primary_10_1109_TCSVT_2022_3207518
crossref_primary_10_1109_TCSVT_2024_3397927
crossref_primary_10_1109_TIP_2023_3242774
crossref_primary_10_1016_j_cviu_2024_104198
crossref_primary_10_1007_s11263_024_02272_8
crossref_primary_10_1109_JSEN_2024_3363042
crossref_primary_10_1016_j_cviu_2024_104150
crossref_primary_10_1016_j_jfranklin_2022_12_016
crossref_primary_10_1109_TCSVT_2023_3235522
crossref_primary_10_1016_j_knosys_2024_111852
crossref_primary_10_1007_s11554_024_01541_6
crossref_primary_10_1109_TPAMI_2024_3367879
crossref_primary_10_1007_s00530_023_01132_8
Cites_doi 10.1109/ICCV.2019.00718
10.1109/ICCV48922.2021.01345
10.1145/3343031.3350876
10.1109/CVPR.2017.179
10.1007/978-3-319-46484-8_2
10.1109/ICCV48922.2021.00675
10.1007/s11263-021-01508-1
10.1109/CVPR46437.2021.01625
10.1007/s11263-018-1111-5
10.1007/978-3-540-74936-3_22
10.1109/ICCV.2019.00209
10.1609/aaai.v31i1.11231
10.1109/ICCVW.2017.373
10.1007/s11263-019-01225-w
10.1109/ICCV.2017.622
10.1109/CVPR.2018.00716
10.1109/ICCV48922.2021.00676
10.1109/CVPR.2017.502
10.1109/ICCV.2015.510
10.1109/ICME.2019.00055
10.1007/978-3-030-01231-1_32
10.1007/978-3-030-01216-8_43
10.1109/ICCV48922.2021.00445
10.1109/ICCV48922.2021.01606
10.1109/TPAMI.2016.2577031
10.1109/CVPR.2018.00931
10.1109/ICCV48922.2021.01332
10.1109/CVPR.2018.00813
10.1109/CVPR.2015.7298594
10.1109/CVPR.2016.213
10.1109/ICCV.2019.00630
10.1109/CVPR42600.2020.00113
10.1023/A:1008155721192
10.1109/CVPR.2016.90
10.1109/ICCV.2019.00561
10.1109/TPAMI.2019.2938758
10.1007/s11263-019-01248-3
10.1007/978-3-030-01246-5_49
10.1109/CVPR.2019.00033
10.1109/CVPR.2018.00155
10.1109/CVPRW.2019.00302
10.1007/978-3-030-58568-6_5
10.1007/978-3-030-01267-0_19
10.1016/j.cviu.2021.103219
10.1109/CVPR42600.2020.01104
10.1109/CVPR.2018.00675
10.1109/CVPR.2009.5206848
10.1007/s11263-016-0934-1
10.1007/s11263-018-1129-8
10.1109/CVPR.2016.308
10.1109/TPAMI.2018.2868668
10.1109/CVPR.2017.291
10.1109/ICCV48922.2021.01325
10.1007/978-3-030-01228-1_25
10.1609/aaai.v34i07.6836
10.1007/978-3-030-58517-4_21
10.1109/CVPR46437.2021.00193
10.1109/CVPR.2018.00710
10.1109/CVPR42600.2020.00043
10.1007/s11263-021-01486-4
10.1109/CVPR42600.2020.00099
ContentType Journal Article
Copyright The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022. Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
COPYRIGHT 2022 Springer
Copyright_xml – notice: The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022. Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
– notice: COPYRIGHT 2022 Springer
DBID AAYXX
CITATION
ISR
3V.
7SC
7WY
7WZ
7XB
87Z
8AL
8FD
8FE
8FG
8FK
8FL
ABUWG
AFKRA
ARAPS
AZQEC
BENPR
BEZIV
BGLVJ
CCPQU
DWQXO
FRNLG
F~G
GNUQQ
HCIFZ
JQ2
K60
K6~
K7-
L.-
L7M
L~C
L~D
M0C
M0N
P5Z
P62
PHGZM
PHGZT
PKEHL
PQBIZ
PQBZA
PQEST
PQGLB
PQQKQ
PQUKI
PYYUZ
Q9U
DOI 10.1007/s11263-022-01661-1
DatabaseName CrossRef
Gale In Context: Science
ProQuest Central (Corporate)
Computer and Information Systems Abstracts
ABI/INFORM Collection
ABI/INFORM Global (PDF only)
ProQuest Central (purchase pre-March 2016)
ABI/INFORM Global (Alumni Edition)
Computing Database (Alumni Edition)
Technology Research Database
ProQuest SciTech Collection
ProQuest Technology Collection
ProQuest Central (Alumni) (purchase pre-March 2016)
ABI/INFORM Collection (Alumni Edition)
ProQuest Central (Alumni Edition)
ProQuest Central UK/Ireland
Advanced Technologies & Aerospace Collection
ProQuest Central Essentials
ProQuest Central
Business Premium Collection
Technology Collection
ProQuest One Community College
ProQuest Central Korea
Business Premium Collection (Alumni)
ABI/INFORM Global (Corporate)
ProQuest Central Student
SciTech Premium Collection
ProQuest Computer Science Collection
ProQuest Business Collection (Alumni Edition)
ProQuest Business Collection
Computer Science Database
ABI/INFORM Professional Advanced
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
ABI/INFORM Global
Computing Database
Advanced Technologies & Aerospace Database
ProQuest Advanced Technologies & Aerospace Collection
ProQuest Central Premium
ProQuest One Academic (New)
ProQuest One Academic Middle East (New)
ProQuest One Business
ProQuest One Business (Alumni)
ProQuest One Academic Eastern Edition (DO NOT USE)
ProQuest One Applied & Life Sciences
ProQuest One Academic
ProQuest One Academic UKI Edition
ABI/INFORM Collection China
ProQuest Central Basic
DatabaseTitle CrossRef
ABI/INFORM Global (Corporate)
ProQuest Business Collection (Alumni Edition)
ProQuest One Business
Computer Science Database
ProQuest Central Student
Technology Collection
Technology Research Database
Computer and Information Systems Abstracts – Academic
ProQuest One Academic Middle East (New)
ProQuest Advanced Technologies & Aerospace Collection
ProQuest Central Essentials
ProQuest Computer Science Collection
Computer and Information Systems Abstracts
ProQuest Central (Alumni Edition)
SciTech Premium Collection
ProQuest One Community College
ABI/INFORM Complete
ProQuest Central
ABI/INFORM Professional Advanced
ProQuest One Applied & Life Sciences
ProQuest Central Korea
ProQuest Central (New)
Advanced Technologies Database with Aerospace
ABI/INFORM Complete (Alumni Edition)
Advanced Technologies & Aerospace Collection
Business Premium Collection
ABI/INFORM Global
ProQuest Computing
ABI/INFORM Global (Alumni Edition)
ProQuest Central Basic
ProQuest Computing (Alumni Edition)
ProQuest One Academic Eastern Edition
ABI/INFORM China
ProQuest Technology Collection
ProQuest SciTech Collection
ProQuest Business Collection
Computer and Information Systems Abstracts Professional
Advanced Technologies & Aerospace Database
ProQuest One Academic UKI Edition
ProQuest One Business (Alumni)
ProQuest One Academic
ProQuest Central (Alumni)
ProQuest One Academic (New)
Business Premium Collection (Alumni)
DatabaseTitleList

ABI/INFORM Global (Corporate)
Database_xml – sequence: 1
  dbid: 8FG
  name: ProQuest Technology Collection
  url: https://search.proquest.com/technologycollection1
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Applied Sciences
Computer Science
EISSN 1573-1405
EndPage 2471
ExternalDocumentID A716919212
10_1007_s11263_022_01661_1
GroupedDBID -4Z
-59
-5G
-BR
-EM
-Y2
-~C
.4S
.86
.DC
.VR
06D
0R~
0VY
199
1N0
1SB
2.D
203
28-
29J
2J2
2JN
2JY
2KG
2KM
2LR
2P1
2VQ
2~H
30V
3V.
4.4
406
408
409
40D
40E
5GY
5QI
5VS
67Z
6NX
6TJ
78A
7WY
8FE
8FG
8FL
8TC
8UJ
95-
95.
95~
96X
AAAVM
AABHQ
AACDK
AAHNG
AAIAL
AAJBT
AAJKR
AANZL
AAOBN
AARHV
AARTL
AASML
AATNV
AATVU
AAUYE
AAWCG
AAYIU
AAYQN
AAYTO
AAYZH
ABAKF
ABBBX
ABBXA
ABDBF
ABDZT
ABECU
ABFTD
ABFTV
ABHLI
ABHQN
ABJNI
ABJOX
ABKCH
ABKTR
ABMNI
ABMQK
ABNWP
ABQBU
ABQSL
ABSXP
ABTEG
ABTHY
ABTKH
ABTMW
ABULA
ABUWG
ABWNU
ABXPI
ACAOD
ACBXY
ACDTI
ACGFO
ACGFS
ACHSB
ACHXU
ACIHN
ACKNC
ACMDZ
ACMLO
ACOKC
ACOMO
ACPIV
ACREN
ACUHS
ACZOJ
ADHHG
ADHIR
ADIMF
ADINQ
ADKNI
ADKPE
ADMLS
ADRFC
ADTPH
ADURQ
ADYFF
ADYOE
ADZKW
AEAQA
AEBTG
AEFIE
AEFQL
AEGAL
AEGNC
AEJHL
AEJRE
AEKMD
AEMSY
AENEX
AEOHA
AEPYU
AESKC
AETLH
AEVLU
AEXYK
AFBBN
AFEXP
AFGCZ
AFKRA
AFLOW
AFQWF
AFWTZ
AFYQB
AFZKB
AGAYW
AGDGC
AGGDS
AGJBK
AGMZJ
AGQEE
AGQMX
AGRTI
AGWIL
AGWZB
AGYKE
AHAVH
AHBYD
AHKAY
AHSBF
AHYZX
AIAKS
AIGIU
AIIXL
AILAN
AITGF
AJBLW
AJRNO
AJZVZ
ALMA_UNASSIGNED_HOLDINGS
ALWAN
AMKLP
AMTXH
AMXSW
AMYLF
AMYQR
AOCGG
ARAPS
ARCSS
ARMRJ
ASPBG
AVWKF
AXYYD
AYJHY
AZFZN
AZQEC
B-.
B0M
BA0
BBWZM
BDATZ
BENPR
BEZIV
BGLVJ
BGNMA
BPHCQ
BSONS
CAG
CCPQU
COF
CS3
CSCUP
DDRTE
DL5
DNIVK
DPUIP
DU5
DWQXO
EAD
EAP
EAS
EBLON
EBS
EDO
EIOEI
EJD
EMK
EPL
ESBYG
ESX
F5P
FEDTE
FERAY
FFXSO
FIGPU
FINBP
FNLPD
FRNLG
FRRFC
FSGXE
FWDCC
GGCAI
GGRSB
GJIRD
GNUQQ
GNWQR
GQ6
GQ7
GQ8
GROUPED_ABI_INFORM_COMPLETE
GXS
H13
HCIFZ
HF~
HG5
HG6
HMJXF
HQYDN
HRMNR
HVGLF
HZ~
I-F
I09
IAO
IHE
IJ-
IKXTQ
ISR
ITC
ITM
IWAJR
IXC
IZIGR
IZQ
I~X
I~Y
I~Z
J-C
J0Z
JBSCW
JCJTX
JZLTJ
K60
K6V
K6~
K7-
KDC
KOV
KOW
LAK
LLZTM
M0C
M0N
M4Y
MA-
N2Q
N9A
NB0
NDZJH
NPVJJ
NQJWS
NU0
O9-
O93
O9G
O9I
O9J
OAM
OVD
P19
P2P
P62
P9O
PF0
PQBIZ
PQBZA
PQQKQ
PROAC
PT4
PT5
QF4
QM1
QN7
QO4
QOK
QOS
R4E
R89
R9I
RHV
RNI
RNS
ROL
RPX
RSV
RZC
RZE
RZK
S16
S1Z
S26
S27
S28
S3B
SAP
SCJ
SCLPG
SCO
SDH
SDM
SHX
SISQX
SJYHP
SNE
SNPRN
SNX
SOHCF
SOJ
SPISZ
SRMVM
SSLCW
STPWE
SZN
T13
T16
TAE
TEORI
TSG
TSK
TSV
TUC
TUS
U2A
UG4
UOJIU
UTJUX
UZXMN
VC2
VFIZW
W23
W48
WK8
YLTOR
Z45
Z7R
Z7S
Z7V
Z7W
Z7X
Z7Y
Z7Z
Z83
Z86
Z88
Z8M
Z8N
Z8P
Z8Q
Z8R
Z8S
Z8T
Z8W
Z92
ZMTXR
~8M
~EX
AAPKM
AAYXX
ABBRH
ABDBE
ABFSG
ACMFV
ACSTC
ADHKG
ADKFA
AEZWR
AFDZB
AFHIU
AFOHR
AGQPQ
AHPBZ
AHWEU
AIXLP
ATHPR
AYFIA
CITATION
ICD
PHGZM
PHGZT
AEIIB
PMFND
7SC
7XB
8AL
8FD
8FK
ABRTQ
JQ2
L.-
L7M
L~C
L~D
PKEHL
PQEST
PQGLB
PQUKI
Q9U
ID FETCH-LOGICAL-c458t-dd74dca28a09fb655a54a6c6569d20da27090afce15d7712b331cb6d10e73a893
IEDL.DBID BENPR
ISSN 0920-5691
IngestDate Wed Aug 13 06:33:06 EDT 2025
Tue Jun 10 21:04:24 EDT 2025
Fri Jun 27 05:27:10 EDT 2025
Thu Apr 24 23:11:06 EDT 2025
Tue Jul 01 04:30:58 EDT 2025
Fri Feb 21 02:44:53 EST 2025
IsPeerReviewed true
IsScholarly true
Issue 10
Keywords Action recognition
Dynamic neural networks
Vision transformers
Motion representation
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c458t-dd74dca28a09fb655a54a6c6569d20da27090afce15d7712b331cb6d10e73a893
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ORCID 0000-0003-3209-8965
PQID 2712350734
PQPubID 1456341
PageCount 19
ParticipantIDs proquest_journals_2712350734
gale_infotracacademiconefile_A716919212
gale_incontextgauss_ISR_A716919212
crossref_primary_10_1007_s11263_022_01661_1
crossref_citationtrail_10_1007_s11263_022_01661_1
springer_journals_10_1007_s11263_022_01661_1
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 2022-10-01
PublicationDateYYYYMMDD 2022-10-01
PublicationDate_xml – month: 10
  year: 2022
  text: 2022-10-01
  day: 01
PublicationDecade 2020
PublicationPlace New York
PublicationPlace_xml – name: New York
PublicationTitle International journal of computer vision
PublicationTitleAbbrev Int J Comput Vis
PublicationYear 2022
Publisher Springer US
Springer
Springer Nature B.V
Publisher_xml – name: Springer US
– name: Springer
– name: Springer Nature B.V
References Zhang, Y., Li, X., Liu, C., Shuai, B., Zhu, Y., Brattoli, B., Chen, H., Marsic, I., & Tighe, J. (2021). Vidtr: Video transformer without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 13577–13587).
Tian, Y., Lu, G., Min, X., Che, Z., Zhai, G., Guo, G., & Gao, Z. (2021). Self-conditioned probabilistic learning of video rescaling. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 4490–4499).
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., & Schmid, C. (2021). Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6836–6846).
FerrymanJMMaybankSJWorrallADVisual surveillance for moving vehiclesInternational Journal of Computer Vision200037218719710.1023/A:1008155721192
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision (pp. 20–36). Springer.
Tian, Y., Min, X., Zhai, G., & Gao, Z. (2019). Video-based early asd detection via temporal pyramid networks. In 2019 IEEE international conference on multimedia and expo (pp. 272–277). IEEE.
Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision (pp. 4489–4497).
FeichtenhoferCPinzAWildesRPZissermanADeep insights into convolutional networks for video recognitionInternational Journal of Computer Vision2020128242043710.1007/s11263-019-01225-w
RenSHeKGirshickRSunJFaster r-cnn: Towards real-time object detection with region proposal networksIEEE Transactions on Pattern Analysis and Machine Intelligence20163961137114910.1109/TPAMI.2016.2577031
Li, Y., Li, Y., & Vasconcelos, N. (2018). Resound: Towards action recognition without representation bias. In Proceedings of the European conference on computer vision (pp. 513–528).
Girdhar, R., Carreira, J., Doersch, C., & Zisserman, A. (2019). Video action transformer network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 244–253).
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1–9).
Ranjan, A., & Black, M. J. (2017). Optical flow estimation using a spatial pyramid network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4161–4170).
Jiang, B., Wang, M., Gan, W., Wu, W., & Yan, J. (2019). Stm: Spatiotemporal and motion encoding for action recognition. In Proceedings of the IEEE international conference on computer vision (pp. 2000–2009).
Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. A. (2017). Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-first AAAI conference on artificial intelligence.
Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., & Wang, L. (2020). Tea: Temporal excitation and aggregation for action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 909–918).
Zach, C., Pock, T., & Bischof, H. (2007). A duality based approach for realtime tv-l 1 optical flow. In Joint pattern recognition symposium (pp. 214–223). Springer.
Wang, H., Tran, D., Torresani, L., & Feiszli, M. (2020). Video modeling with correlation networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 352–361).
Tian, Y., Che, Z., Bao, W., Zhai, G., & Gao, Z. (2020). Self-supervised motion representation via scattering local motion cues. In European conference on computer vision (pp. 71–89). Springer
Kanojia, G., Kumawat, S., & Raman, S. (2019). Attentive spatio-temporal representation learning for diving classification. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops.
Sun, D., Yang, X., Liu, M. Y., & Kautz, J. (2018). Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8934–8943).
Wang, L., Tong, Z., Ji, B., & Wu, G. (2020). Tdn: Temporal difference networks for efficient action recognition. arXiv:2012.10071
Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. In Proceedings of the IEEE international conference on computer vision (pp. 6202–6211).
Srinivas, A., Lin, T. Y., Parmar, N., Shlens, J., Abbeel, P., & Vaswani, A. (2021). Bottleneck transformers for visual recognition. arXiv:2101.11605
LuCShiJWangWJiaJFast abnormal event detectionInternational Journal of Computer Vision20191278993101110.1007/s11263-018-1129-8
Chen, Y., Dai, X., Liu, M., Chen, D., Yuan, L., & Liu, Z. (2020). Dynamic convolution: Attention over convolution kernels. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11030–11039).
Goyal, R., Kahou, S. E., Michalski, V., Materzynska, J., Westphal, S., Kim, H., et al. (2017). The“something something” video database for learning and evaluating visual common sense. Proceedings of the IEEE International Conference on Computer Vision,1, 5.
Girdhar, R., & Grauman, K. (2021). Anticipative video transformer. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 13505–13515).
Mahdisoltani, F., Berger, G., Gharbieh, W., Fleet, D., & Memisevic, R. (2018). Fine-grained video classification and captioning. arXiv:1804.092355(6)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. arXiv:1706.03762
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).
CherianAGouldSSecond-order temporal pooling for action recognitionInternational Journal of Computer Vision2019127434036210.1007/s11263-018-1111-5
Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6299–6308).
Wang, L., Li, W., Li, W., & Van Gool, L. (2018). Appearance-and-relation networks for video classification. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1430–1439).
Wang, X., & Gupta, A. (2018). Videos as space-time region graphs. In Proceedings of the European conference on computer vision (pp. 399–417).
Ma, C. Y., Kadav, A., Melvin, I., Kira, Z., AlRegib, G., & Peter Graf, H. (2018). Attend and interact: Higher-order object interactions for video understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6790–6800).
Yang, B., Bender, G., Le, Q.V., & Ngiam, J. (2019). Condconv: Conditionally parameterized convolutions for efficient inference. arXiv:1904.04971
Bertasius, G., Wang, H., &Torresani, L. (2021). Is space-time attention all you need for video understanding? arXiv:2102.05095
Feichtenhofer, C., Pinz, A., & Zisserman, A. (2016). Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1933–1941).
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., & Paluri, M. (2018). A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 6450–6459).
Wu, Z., Li, H., Zheng, Y., Xiong, C., Jiang, Y. G., & Davis, L. S. (2021). A coarse-to-fine framework for resource efficient video recognition. International Journal of Computer Vision.
PlizzariCCanniciMMatteucciMSkeleton-based action recognition via spatial and temporal transformer networksComputer Vision and Image Understanding202120810.1016/j.cviu.2021.103219
Kwon, H., Kim, M., Kwak, S., & Cho, M. (2020). Motionsqueeze: Neural motion feature learning for video understanding. In European conference on computer vision (pp. 345–362). Springer.
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2818–2826).
Cong, Y., Liao, W., Ackermann, H., Rosenhahn, B., & Yang, M. Y. (2021). Spatial-temporal transformer for dynamic scene graph generation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 16372–16382).
BenschRScherfNHuiskenJBroxTRonnebergerOSpatiotemporal deformable prototypes for motion anomaly detectionInternational Journal of Computer Vision2017122350252310.1007/s11263-016-0934-1
Zhang, X., Zhou, X., Lin, M., & Sun, J. (2018). Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6848–6856).
Liu, Z., Wang, L., Wu, W., Qian, C., & Lu, T. (2020). Tam: Temporal adaptive module for video recognition. arXiv:2005.06803
Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7794–7803).
Liu, Z., Luo, D., Wang, Y., Wang, L., Tai, Y., Wang, C., Li, J., Huang, F., & Lu, T. (2020). Teinet: Towards an efficient architecture for video recognition. In Proceedings of the AAAI conference on artificial intelligence (Vol. 34, pp. 11669–11676).
Luo, C., & Yuille, A. L. (2019). Grouped spatial-temporal aggregation for efficient action recognition. In Proceedings of the IEEE international conference on computer vision (pp. 5512–5521).
ChenXPangAYangWMaYXuLYuJSportscap: Monocular 3d human motion capture and fine-grained understanding in challenging sports videosInternational Journal of Computer Vis
1661_CR61
1661_CR63
1661_CR62
1661_CR21
1661_CR65
1661_CR20
1661_CR64
1661_CR23
C Plizzari (1661_CR40) 2021; 208
1661_CR67
1661_CR22
1661_CR66
1661_CR69
1661_CR24
1661_CR68
1661_CR27
1661_CR26
1661_CR29
1661_CR28
C Feichtenhofer (1661_CR15) 2020; 128
1661_CR8
1661_CR6
1661_CR4
1661_CR50
1661_CR5
1661_CR52
1661_CR3
1661_CR51
1661_CR10
1661_CR54
1661_CR1
1661_CR53
1661_CR12
1661_CR56
1661_CR11
1661_CR55
1661_CR14
S Ren (1661_CR42) 2016; 39
1661_CR58
1661_CR13
X Jia (1661_CR25) 2016; 29
1661_CR57
1661_CR16
C Lu (1661_CR35) 2019; 127
1661_CR59
JM Ferryman (1661_CR17) 2000; 37
R Bensch (1661_CR2) 2017; 122
1661_CR19
1661_CR41
1661_CR43
1661_CR45
1661_CR44
1661_CR47
1661_CR46
1661_CR49
1661_CR48
SH Gao (1661_CR18) 2019; 43
X Chen (1661_CR7) 2021; 129
1661_CR70
1661_CR71
1661_CR30
1661_CR32
1661_CR31
1661_CR34
1661_CR33
1661_CR36
A Cherian (1661_CR9) 2019; 127
1661_CR38
1661_CR37
L Wang (1661_CR60) 2018; 41
1661_CR39
References_xml – reference: Girdhar, R., Carreira, J., Doersch, C., & Zisserman, A. (2019). Video action transformer network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 244–253).
– reference: Cong, Y., Liao, W., Ackermann, H., Rosenhahn, B., & Yang, M. Y. (2021). Spatial-temporal transformer for dynamic scene graph generation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 16372–16382).
– reference: Wu, Z., Li, H., Zheng, Y., Xiong, C., Jiang, Y. G., & Davis, L. S. (2021). A coarse-to-fine framework for resource efficient video recognition. International Journal of Computer Vision.
– reference: CherianAGouldSSecond-order temporal pooling for action recognitionInternational Journal of Computer Vision2019127434036210.1007/s11263-018-1111-5
– reference: Khowaja, S. A., & Lee, S. L. (2020). Semantic image networks for human action recognition. International Journal of Computer Vision.
– reference: Feichtenhofer, C., Pinz, A., & Zisserman, A. (2016). Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1933–1941).
– reference: Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., & Wang, L. (2020). Tea: Temporal excitation and aggregation for action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 909–918).
– reference: Tian, Y., Min, X., Zhai, G., & Gao, Z. (2019). Video-based early asd detection via temporal pyramid networks. In 2019 IEEE international conference on multimedia and expo (pp. 272–277). IEEE.
– reference: Wang, L., Tong, Z., Ji, B., & Wu, G. (2020). Tdn: Temporal difference networks for efficient action recognition. arXiv:2012.10071
– reference: Ranjan, A., & Black, M. J. (2017). Optical flow estimation using a spatial pyramid network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4161–4170).
– reference: Sun, D., Yang, X., Liu, M. Y., & Kautz, J. (2018). Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8934–8943).
– reference: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. arXiv:1706.03762
– reference: BenschRScherfNHuiskenJBroxTRonnebergerOSpatiotemporal deformable prototypes for motion anomaly detectionInternational Journal of Computer Vision2017122350252310.1007/s11263-016-0934-1
– reference: Zhang, X., Zhou, X., Lin, M., & Sun, J. (2018). Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6848–6856).
– reference: Ma, C. Y., Kadav, A., Melvin, I., Kira, Z., AlRegib, G., & Peter Graf, H. (2018). Attend and interact: Higher-order object interactions for video understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6790–6800).
– reference: Lin, J., Gan, C., & Han, S. (2019). Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE international conference on computer vision (pp. 7083–7093).
– reference: Tian, Y., Che, Z., Bao, W., Zhai, G., & Gao, Z. (2020). Self-supervised motion representation via scattering local motion cues. In European conference on computer vision (pp. 71–89). Springer
– reference: Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., & Brox, T. (2017). Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2462–2470).
– reference: Liu, Z., Luo, D., Wang, Y., Wang, L., Tai, Y., Wang, C., Li, J., Huang, F., & Lu, T. (2020). Teinet: Towards an efficient architecture for video recognition. In Proceedings of the AAAI conference on artificial intelligence (Vol. 34, pp. 11669–11676).
– reference: Zolfaghari, M., Singh, K., & Brox, T. (2018). Eco: Efficient convolutional network for online video understanding. In Proceedings of the European conference on computer vision (pp. 695–712).
– reference: FerrymanJMMaybankSJWorrallADVisual surveillance for moving vehiclesInternational Journal of Computer Vision200037218719710.1023/A:1008155721192
– reference: Yang, B., Bender, G., Le, Q.V., & Ngiam, J. (2019). Condconv: Conditionally parameterized convolutions for efficient inference. arXiv:1904.04971
– reference: Materzynska, J., Xiao, T., Herzig, R., Xu, H., Wang, X., & Darrell, T. (2020). Something-else: Compositional action recognition with spatial-temporal interaction networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1049–1059).
– reference: Wang, H., Tran, D., Torresani, L., & Feiszli, M. (2020). Video modeling with correlation networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 352–361).
– reference: PlizzariCCanniciMMatteucciMSkeleton-based action recognition via spatial and temporal transformer networksComputer Vision and Image Understanding202120810.1016/j.cviu.2021.103219
– reference: Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., & Paluri, M. (2018). A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 6450–6459).
– reference: Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision (pp. 4489–4497).
– reference: Kwon, H., Kim, M., Kwak, S., & Cho, M. (2020). Motionsqueeze: Neural motion feature learning for video understanding. In European conference on computer vision (pp. 345–362). Springer.
– reference: Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1–9).
– reference: Tian, Y., Lu, G., Min, X., Che, Z., Zhai, G., Guo, G., & Gao, Z. (2021). Self-conditioned probabilistic learning of video rescaling. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 4490–4499).
– reference: Kanojia, G., Kumawat, S., & Raman, S. (2019). Attentive spatio-temporal representation learning for diving classification. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops.
– reference: Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. A. (2017). Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-first AAAI conference on artificial intelligence.
– reference: Wang, L., Li, W., Li, W., & Van Gool, L. (2018). Appearance-and-relation networks for video classification. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1430–1439).
– reference: FeichtenhoferCPinzAWildesRPZissermanADeep insights into convolutional networks for video recognitionInternational Journal of Computer Vision2020128242043710.1007/s11263-019-01225-w
– reference: WangLXiongYWangZQiaoYLinDTangXVan GoolLTemporal segment networks for action recognition in videosIEEE Transactions on Pattern Analysis and Machine Intelligence201841112740275510.1109/TPAMI.2018.2868668
– reference: Girdhar, R., & Grauman, K. (2021). Anticipative video transformer. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 13505–13515).
– reference: Goyal, R., Kahou, S. E., Michalski, V., Materzynska, J., Westphal, S., Kim, H., et al. (2017). The“something something” video database for learning and evaluating visual common sense. Proceedings of the IEEE International Conference on Computer Vision,1, 5.
– reference: Mahdisoltani, F., Berger, G., Gharbieh, W., Fleet, D., & Memisevic, R. (2018). Fine-grained video classification and captioning. arXiv:1804.092355(6)
– reference: Bertasius, G., Wang, H., &Torresani, L. (2021). Is space-time attention all you need for video understanding? arXiv:2102.05095
– reference: Liu, Z., Wang, L., Wu, W., Qian, C., & Lu, T. (2020). Tam: Temporal adaptive module for video recognition. arXiv:2005.06803
– reference: Bulat, A., Perez Rua, J. M., Sudhakaran, S., Martinez, B., & Tzimiropoulos, G. (2021). Space-time mixing attention for video transformer. Advances in Neural Information Processing Systems,34
– reference: Hara, K., Kataoka, H., & Satoh, Y. (2017). Learning spatio-temporal features with 3d residual networks for action recognition. In Proceedings of the IEEE international conference on computer vision workshops (pp. 3154–3160).
– reference: Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2818–2826).
– reference: Zhang, Y., Li, X., Liu, C., Shuai, B., Zhu, Y., Brattoli, B., Chen, H., Marsic, I., & Tighe, J. (2021). Vidtr: Video transformer without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 13577–13587).
– reference: JiaXDe BrabandereBTuytelaarsTGoolLVDynamic filter networksAdvances in Neural Information Processing Systems201629667675
– reference: ChenXPangAYangWMaYXuLYuJSportscap: Monocular 3d human motion capture and fine-grained understanding in challenging sports videosInternational Journal of Computer Vision2021129102846286410.1007/s11263-021-01486-4
– reference: Luo, C., & Yuille, A. L. (2019). Grouped spatial-temporal aggregation for efficient action recognition. In Proceedings of the IEEE international conference on computer vision (pp. 5512–5521).
– reference: Zhou, B., Andonian, A., Oliva, A., & Torralba, A. (2018). Temporal relational reasoning in videos. In Proceedings of the European conference on computer vision (pp. 803–818).
– reference: Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. In Proceedings of the IEEE international conference on computer vision (pp. 6202–6211).
– reference: Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6299–6308).
– reference: Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision (pp. 20–36). Springer.
– reference: RenSHeKGirshickRSunJFaster r-cnn: Towards real-time object detection with region proposal networksIEEE Transactions on Pattern Analysis and Machine Intelligence20163961137114910.1109/TPAMI.2016.2577031
– reference: LuCShiJWangWJiaJFast abnormal event detectionInternational Journal of Computer Vision20191278993101110.1007/s11263-018-1129-8
– reference: Jiang, B., Wang, M., Gan, W., Wu, W., & Yan, J. (2019). Stm: Spatiotemporal and motion encoding for action recognition. In Proceedings of the IEEE international conference on computer vision (pp. 2000–2009).
– reference: GaoSHChengMMZhaoKZhangXYYangMHTorrPRes2net: A new multi-scale backbone architectureIEEE Transactions on Pattern Analysis and Machine Intelligence201943265266210.1109/TPAMI.2019.2938758
– reference: He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).
– reference: Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., & Feichtenhofer, C. (2021). Multiscale vision transformers. arXiv:2104.11227
– reference: Xie, S., Sun, C., Huang, J., Tu, Z., & Murphy, K. (2018). Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European conference on computer vision (pp. 305–321).
– reference: Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248–255). Ieee.
– reference: Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems (pp. 568–576).
– reference: Bertasius, G., Feichtenhofer, C., Tran, D., Shi, J., & Torresani, L. (2018). Learning discriminative motion features through detection. arXiv:1812.04172
– reference: Zhang, C., Zou, Y., Chen, G., & Gan, L. (2020). Pan: Towards fast action recognition via learning persistence of appearance. arXiv:2008.03462
– reference: Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., & Schmid, C. (2021). Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6836–6846).
– reference: Chen, Y., Dai, X., Liu, M., Chen, D., Yuan, L., & Liu, Z. (2020). Dynamic convolution: Attention over convolution kernels. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11030–11039).
– reference: Zach, C., Pock, T., & Bischof, H. (2007). A duality based approach for realtime tv-l 1 optical flow. In Joint pattern recognition symposium (pp. 214–223). Springer.
– reference: Wang, X., & Gupta, A. (2018). Videos as space-time region graphs. In Proceedings of the European conference on computer vision (pp. 399–417).
– reference: Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2020). Training data-efficient image transformers and distillation through attention. arXiv:2012.12877
– reference: Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., & Gelly, S., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929
– reference: Srinivas, A., Lin, T. Y., Parmar, N., Shlens, J., Abbeel, P., & Vaswani, A. (2021). Bottleneck transformers for visual recognition. arXiv:2101.11605
– reference: Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7794–7803).
– reference: Li, Y., Li, Y., & Vasconcelos, N. (2018). Resound: Towards action recognition without representation bias. In Proceedings of the European conference on computer vision (pp. 513–528).
– ident: 1661_CR32
  doi: 10.1109/ICCV.2019.00718
– ident: 1661_CR34
  doi: 10.1109/ICCV48922.2021.01345
– ident: 1661_CR67
  doi: 10.1145/3343031.3350876
– ident: 1661_CR24
  doi: 10.1109/CVPR.2017.179
– ident: 1661_CR59
  doi: 10.1007/978-3-319-46484-8_2
– volume: 29
  start-page: 667
  year: 2016
  ident: 1661_CR25
  publication-title: Advances in Neural Information Processing Systems
– ident: 1661_CR65
– ident: 1661_CR13
  doi: 10.1109/ICCV48922.2021.00675
– ident: 1661_CR63
  doi: 10.1007/s11263-021-01508-1
– ident: 1661_CR4
– ident: 1661_CR52
– ident: 1661_CR44
  doi: 10.1109/CVPR46437.2021.01625
– volume: 127
  start-page: 340
  issue: 4
  year: 2019
  ident: 1661_CR9
  publication-title: International Journal of Computer Vision
  doi: 10.1007/s11263-018-1111-5
– ident: 1661_CR66
  doi: 10.1007/978-3-540-74936-3_22
– ident: 1661_CR26
  doi: 10.1109/ICCV.2019.00209
– ident: 1661_CR46
  doi: 10.1609/aaai.v31i1.11231
– ident: 1661_CR22
  doi: 10.1109/ICCVW.2017.373
– volume: 128
  start-page: 420
  issue: 2
  year: 2020
  ident: 1661_CR15
  publication-title: International Journal of Computer Vision
  doi: 10.1007/s11263-019-01225-w
– ident: 1661_CR21
  doi: 10.1109/ICCV.2017.622
– ident: 1661_CR68
  doi: 10.1109/CVPR.2018.00716
– ident: 1661_CR1
  doi: 10.1109/ICCV48922.2021.00676
– ident: 1661_CR6
  doi: 10.1109/CVPR.2017.502
– ident: 1661_CR53
  doi: 10.1109/ICCV.2015.510
– ident: 1661_CR51
  doi: 10.1109/ICME.2019.00055
– ident: 1661_CR31
  doi: 10.1007/978-3-030-01231-1_32
– ident: 1661_CR71
  doi: 10.1007/978-3-030-01216-8_43
– ident: 1661_CR50
  doi: 10.1109/ICCV48922.2021.00445
– ident: 1661_CR10
  doi: 10.1109/ICCV48922.2021.01606
– volume: 39
  start-page: 1137
  issue: 6
  year: 2016
  ident: 1661_CR42
  publication-title: IEEE Transactions on Pattern Analysis and Machine Intelligence
  doi: 10.1109/TPAMI.2016.2577031
– ident: 1661_CR45
  doi: 10.1109/CVPR.2018.00931
– ident: 1661_CR69
  doi: 10.1109/ICCV48922.2021.01332
– ident: 1661_CR61
  doi: 10.1109/CVPR.2018.00813
– ident: 1661_CR47
  doi: 10.1109/CVPR.2015.7298594
– ident: 1661_CR55
– ident: 1661_CR5
– ident: 1661_CR16
  doi: 10.1109/CVPR.2016.213
– ident: 1661_CR38
– ident: 1661_CR14
  doi: 10.1109/ICCV.2019.00630
– ident: 1661_CR39
  doi: 10.1109/CVPR42600.2020.00113
– volume: 37
  start-page: 187
  issue: 2
  year: 2000
  ident: 1661_CR17
  publication-title: International Journal of Computer Vision
  doi: 10.1023/A:1008155721192
– ident: 1661_CR23
  doi: 10.1109/CVPR.2016.90
– ident: 1661_CR36
  doi: 10.1109/ICCV.2019.00561
– volume: 43
  start-page: 652
  issue: 2
  year: 2019
  ident: 1661_CR18
  publication-title: IEEE Transactions on Pattern Analysis and Machine Intelligence
  doi: 10.1109/TPAMI.2019.2938758
– ident: 1661_CR28
  doi: 10.1007/s11263-019-01248-3
– ident: 1661_CR70
  doi: 10.1007/978-3-030-01246-5_49
– ident: 1661_CR19
  doi: 10.1109/CVPR.2019.00033
– ident: 1661_CR57
  doi: 10.1109/CVPR.2018.00155
– ident: 1661_CR27
  doi: 10.1109/CVPRW.2019.00302
– ident: 1661_CR49
  doi: 10.1007/978-3-030-58568-6_5
– ident: 1661_CR64
  doi: 10.1007/978-3-030-01267-0_19
– volume: 208
  year: 2021
  ident: 1661_CR40
  publication-title: Computer Vision and Image Understanding
  doi: 10.1016/j.cviu.2021.103219
– ident: 1661_CR8
  doi: 10.1109/CVPR42600.2020.01104
– ident: 1661_CR54
  doi: 10.1109/CVPR.2018.00675
– ident: 1661_CR11
  doi: 10.1109/CVPR.2009.5206848
– volume: 122
  start-page: 502
  issue: 3
  year: 2017
  ident: 1661_CR2
  publication-title: International Journal of Computer Vision
  doi: 10.1007/s11263-016-0934-1
– volume: 127
  start-page: 993
  issue: 8
  year: 2019
  ident: 1661_CR35
  publication-title: International Journal of Computer Vision
  doi: 10.1007/s11263-018-1129-8
– ident: 1661_CR48
  doi: 10.1109/CVPR.2016.308
– ident: 1661_CR12
– volume: 41
  start-page: 2740
  issue: 11
  year: 2018
  ident: 1661_CR60
  publication-title: IEEE Transactions on Pattern Analysis and Machine Intelligence
  doi: 10.1109/TPAMI.2018.2868668
– ident: 1661_CR43
– ident: 1661_CR41
  doi: 10.1109/CVPR.2017.291
– ident: 1661_CR20
  doi: 10.1109/ICCV48922.2021.01325
– ident: 1661_CR62
  doi: 10.1007/978-3-030-01228-1_25
– ident: 1661_CR33
  doi: 10.1609/aaai.v34i07.6836
– ident: 1661_CR29
  doi: 10.1007/978-3-030-58517-4_21
– ident: 1661_CR58
  doi: 10.1109/CVPR46437.2021.00193
– ident: 1661_CR37
  doi: 10.1109/CVPR.2018.00710
– ident: 1661_CR56
  doi: 10.1109/CVPR42600.2020.00043
– ident: 1661_CR3
– volume: 129
  start-page: 2846
  issue: 10
  year: 2021
  ident: 1661_CR7
  publication-title: International Journal of Computer Vision
  doi: 10.1007/s11263-021-01486-4
– ident: 1661_CR30
  doi: 10.1109/CVPR42600.2020.00099
SSID ssj0002823
Score 2.571239
Snippet Efficiently modeling spatial–temporal information in videos is crucial for action recognition. To achieve this goal, state-of-the-art methods typically employ...
Efficiently modeling spatial-temporal information in videos is crucial for action recognition. To achieve this goal, state-of-the-art methods typically employ...
SourceID proquest
gale
crossref
springer
SourceType Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 2453
SubjectTerms Activity recognition
Artificial Intelligence
Computer Imaging
Computer Science
Design
Image Processing and Computer Vision
Interaction models
Modelling
Modules
Pattern Recognition
Pattern Recognition and Graphics
Sensors
Video
Vision
SummonAdditionalLinks – databaseName: SpringerLink Journals (ICM)
  dbid: U2A
  link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEA5SL158i9UqQQQPurBJ9qmnRbZUwR6qhd5CNsnqQbbFbf-_k2y2pb7A6-7sIzPJPJj5ZhC6TIgkKimEJ4sIApRIMi-JSuoxHYAx8lkqfQMUfhpGg3HwOAknDhRWt9XubUrSauoV2I1Qm3M0pQRgVTyIeTZDE7vDLh7TbKl_IYhoBshDYBRGKXFQmZ_fsWaOvirlb9lRa3T6u2jbeYs4a8S7hzZ0tY92nOeI3bms4VI7nKG9doDu8mx4i3NTzogzJWZGreFhU_SNwVPFefVms_84s9AGPGpLiabVIRr385f7gecmJXgyCJO5p1QcKCloIvy0LKIwFGEgIgm-WqqorwSN_dQXpdQkVHFMaMEYAeEo4uuYCXBZjlCnmlb6GGHNWCzKRJfC9i4TRaxAA2oBnIZgkAZdRFqGcenaiJtpFu981QDZMJkDk7llMidddL18ZtY00fiT-sLIgZvuFJUpf3kVi7rmD88jntnePimY2y66ckTlFD4vhUMTwCJMQ6s1yl4rT-7OZ81pbDDCoN5gPTetjFe3f_-5k_-Rn6Itanabrf7roc78Y6HPwIuZF-d2034Csa7j2Q
  priority: 102
  providerName: Springer Nature
Title EAN: Event Adaptive Network for Enhanced Action Recognition
URI https://link.springer.com/article/10.1007/s11263-022-01661-1
https://www.proquest.com/docview/2712350734
Volume 130
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV1LT9wwEB4Be-mFttCKLbCyqkocaETsvJz2gNI2C-WxqhZWgpPl2E57QNlts_z_jh2HFVTlFMl2XjPj8Yz9zQzAB04V1bySgapSdFBSFQU8rVkQmRgXozDKVWgDhS8n6eksPrtJbvyGW-thlb1OdIpaz5XdIz9imY3qRIGMjxe_A1s1yp6u-hIa6zBAFczR-Rp8KSc_pg-6GB2Krpg8OklJmlMfNtMFz1HmzjAtNAFXqYA-WpqeKuh_TkrdAjR-BZveciRFx-rXsGaaLXjprUji52iLTX2hhr5tGz6XxeQTKS20kRRaLqyKI5MOAE7QaiVl88shAUjhwhzItIcVzZs3MBuX119PA181IVBxwpeB1lmslWRchnldpUkik1imCu22XLNQS5aFeShrZWiiMyRqFUUUGaVpaLJIovnyFjaaeWN2gJgoymTNTS1dHjNZZRq1oZEJWg08ZvEQaE8woXxKcVvZ4k6skiFbIgsksnBEFnQIhw_3LLqEGs-Ofm_5IGymisZCYX7K-7YV36-monB5fnLk-xAO_KB6jq9X0kcW4E_Y5FaPRu71_BR-rrZiJVlD-NjzeNX9_4979_zTduEFs9LlkH97sLH8c2_20YJZViNY5-OTEQyKb5cXV_Z6cntejrzwYu-MFX8BOCntVQ
linkProvider ProQuest
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1Lb9QwELaqcoALb8RCAQuBOEBE_MirqKoi2GWXtnsordSbcWwHDii7kK0Qf6q_kRnH7qogeuvVcex4PJ5H_M0MIS9KZpgtG52YJgcHJTciKfOWJ8JJUEapqEyKgcIH83x6LD-dZCcb5CzGwiCsMspEL6jtwuA_8re8wKhOYEi5u_yRYNUovF2NJTQGtthzv3-By9bvzD7A_r7kfDI-ej9NQlWBxMisXCXWFtIazUudVm2TZ5nOpM4N2DWV5anVvEirVLfGscwWMGkjBIOFWJa6QugSky-ByL8mhajwRJWTj-eSH9yXoXQ9uGQwHAtBOkOoHuP-xhSBEKATE3ZBEf6tDv65l_XqbnKb3Ax2Kq0HxrpDNlx3l9wKNisNEqGHplgWIrbdI-_G9XybjhFISWurlyhQ6XyAm1Owkem4--ZxB7T2QRX0MIKYFt19cnwl1HxANrtF5x4S6oQodFu6VvusabopLMhepzOwUUrJ5YiwSDBlQgJzrKPxXa1TLyORFRBZeSIrNiKvz99ZDuk7Lu39HPdBYV6MDoE3X_Vp36vZ50NV-6xCFSj6EXkVOrULmN7oEMcAi8BUWhd6bsX9VEEy9GrNxyPyJu7x-vH_P-7R5aM9I9enRwf7an8233tMbnDkNI853CKbq5-n7gnYTqvmqWdYSr5c9Qn5A3kjI_8
linkToPdf http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1Lb9QwEB5VWwlxoTzF0hYsBOIAUWPnDUIo0Ky6FKJqoVJvxrGdckDZLdmq4q_113Xs2F0VRG-9Oo5jj8fziL-ZAXiRU0lV3ohANik6KKmMgjxtWRDpGJVRGBUyNIHCX-t07zD-fJQcrcG5j4UxsEovE62gVnNp_pHvsMxEdSJDxjutg0Uc7E4-LE4CU0HK3LT6choDi-zrP2fovvXvp7u41y8Zm1TfP-0FrsJAIOMkXwZKZbGSguUiLNomTRKRxCKVaOMUioVKsCwsQtFKTROV4QSaKKK4KEVDnUUiN4mYUPyvZ-gVhSNY_1jVB7NLPYDOzFDIHh00HJC6kJ0hcI8ye39qYBGoIQN6RS3-rRz-uaW1ym9yF-44q5WUA5vdgzXd3YcNZ8ESJx96bPJFInzbA3hXlfVbUhlYJSmVWBjxSuoBfE7QYiZV99OiEEhpQyzIzEOa5t1DOLwRej6CUTfv9GMgOooy0ea6FTaHmmgyhZJYiwQtljxm8RioJxiXLp25qarxi68SMRsicyQyt0TmdAyvL99ZDMk8ru393OwDN1kyOsNvx-K07_n024yXNsdQgWp_DK9cp3aOn5fCRTXgIkxirSs9t_x-cicner7i6jG88Xu8evz_yT25frRncAtPB_8yrfc34TYzjGYBiFswWv4-1dtoSC2bp45jCfy46UNyAU8nKZE
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=EAN%3A+Event+Adaptive+Network+for+Enhanced+Action+Recognition&rft.jtitle=International+journal+of+computer+vision&rft.au=Tian%2C+Yuan&rft.au=Yan%2C+Yichao&rft.au=Zhai%2C+Guangtao&rft.au=Guo%2C+Guodong&rft.date=2022-10-01&rft.pub=Springer&rft.issn=0920-5691&rft.volume=130&rft.issue=10&rft.spage=2453&rft_id=info:doi/10.1007%2Fs11263-022-01661-1&rft.externalDBID=ISR&rft.externalDocID=A716919212
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0920-5691&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0920-5691&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0920-5691&client=summon