P2T: Pyramid Pooling Transformer for Scene Understanding

Recently, the vision transformer has achieved great success by pushing the state-of-the-art of various vision tasks. One of the most challenging problems in the vision transformer is that the large sequence length of image tokens leads to high computational cost (quadratic complexity). A popular sol...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on pattern analysis and machine intelligence Vol. 45; no. 11; pp. 12760 - 12771
Main Authors Wu, Yu-Huan, Liu, Yun, Zhan, Xin, Cheng, Ming-Ming
Format Journal Article
LanguageEnglish
Published New York IEEE 01.11.2023
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Recently, the vision transformer has achieved great success by pushing the state-of-the-art of various vision tasks. One of the most challenging problems in the vision transformer is that the large sequence length of image tokens leads to high computational cost (quadratic complexity). A popular solution to this problem is to use a single pooling operation to reduce the sequence length. This paper considers how to improve existing vision transformers, where the pooled feature extracted by a single pooling operation seems less powerful. To this end, we note that pyramid pooling has been demonstrated to be effective in various vision tasks owing to its powerful ability in context abstraction. However, pyramid pooling has not been explored in backbone network design. To bridge this gap, we propose to adapt pyramid pooling to Multi-Head Self-Attention (MHSA) in the vision transformer, simultaneously reducing the sequence length and capturing powerful contextual features. Plugged with our pooling-based MHSA, we build a universal vision transformer backbone, dubbed Pyramid Pooling Transformer (P2T). Extensive experiments demonstrate that, when applied P2T as the backbone network, it shows substantial superiority in various vision tasks such as image classification, semantic segmentation, object detection, and instance segmentation, compared to previous CNN- and transformer-based networks. The code will be released at https://github.com/yuhuan-wu/P2T .
AbstractList Recently, the vision transformer has achieved great success by pushing the state-of-the-art of various vision tasks. One of the most challenging problems in the vision transformer is that the large sequence length of image tokens leads to high computational cost (quadratic complexity). A popular solution to this problem is to use a single pooling operation to reduce the sequence length. This paper considers how to improve existing vision transformers, where the pooled feature extracted by a single pooling operation seems less powerful. To this end, we note that pyramid pooling has been demonstrated to be effective in various vision tasks owing to its powerful ability in context abstraction. However, pyramid pooling has not been explored in backbone network design. To bridge this gap, we propose to adapt pyramid pooling to Multi-Head Self-Attention (MHSA) in the vision transformer, simultaneously reducing the sequence length and capturing powerful contextual features. Plugged with our pooling-based MHSA, we build a universal vision transformer backbone, dubbed Pyramid Pooling Transformer (P2T). Extensive experiments demonstrate that, when applied P2T as the backbone network, it shows substantial superiority in various vision tasks such as image classification, semantic segmentation, object detection, and instance segmentation, compared to previous CNN- and transformer-based networks. The code will be released at https://github.com/yuhuan-wu/P2T.Recently, the vision transformer has achieved great success by pushing the state-of-the-art of various vision tasks. One of the most challenging problems in the vision transformer is that the large sequence length of image tokens leads to high computational cost (quadratic complexity). A popular solution to this problem is to use a single pooling operation to reduce the sequence length. This paper considers how to improve existing vision transformers, where the pooled feature extracted by a single pooling operation seems less powerful. To this end, we note that pyramid pooling has been demonstrated to be effective in various vision tasks owing to its powerful ability in context abstraction. However, pyramid pooling has not been explored in backbone network design. To bridge this gap, we propose to adapt pyramid pooling to Multi-Head Self-Attention (MHSA) in the vision transformer, simultaneously reducing the sequence length and capturing powerful contextual features. Plugged with our pooling-based MHSA, we build a universal vision transformer backbone, dubbed Pyramid Pooling Transformer (P2T). Extensive experiments demonstrate that, when applied P2T as the backbone network, it shows substantial superiority in various vision tasks such as image classification, semantic segmentation, object detection, and instance segmentation, compared to previous CNN- and transformer-based networks. The code will be released at https://github.com/yuhuan-wu/P2T.
Recently, the vision transformer has achieved great success by pushing the state-of-the-art of various vision tasks. One of the most challenging problems in the vision transformer is that the large sequence length of image tokens leads to high computational cost (quadratic complexity). A popular solution to this problem is to use a single pooling operation to reduce the sequence length. This paper considers how to improve existing vision transformers, where the pooled feature extracted by a single pooling operation seems less powerful. To this end, we note that pyramid pooling has been demonstrated to be effective in various vision tasks owing to its powerful ability in context abstraction. However, pyramid pooling has not been explored in backbone network design. To bridge this gap, we propose to adapt pyramid pooling to Multi-Head Self-Attention (MHSA) in the vision transformer, simultaneously reducing the sequence length and capturing powerful contextual features. Plugged with our pooling-based MHSA, we build a universal vision transformer backbone, dubbed Pyramid Pooling Transformer (P2T). Extensive experiments demonstrate that, when applied P2T as the backbone network, it shows substantial superiority in various vision tasks such as image classification, semantic segmentation, object detection, and instance segmentation, compared to previous CNN- and transformer-based networks. The code will be released at https://github.com/yuhuan-wu/P2T .
Author Zhan, Xin
Wu, Yu-Huan
Cheng, Ming-Ming
Liu, Yun
Author_xml – sequence: 1
  givenname: Yu-Huan
  orcidid: 0000-0001-8666-3435
  surname: Wu
  fullname: Wu, Yu-Huan
  email: wuyuhuan@mail.nankai.edu.cn
  organization: TMCC, College of Computer Science, Nankai University, Tianjin, China
– sequence: 2
  givenname: Yun
  orcidid: 0000-0001-6143-0264
  surname: Liu
  fullname: Liu, Yun
  email: vagrantlyun@gmail.com
  organization: Institute for Infocomm Research (I2R), Agency for Science, Technology and Research (ASTAR), Singapore
– sequence: 3
  givenname: Xin
  surname: Zhan
  fullname: Zhan, Xin
  email: zhanxin.zx@alibabainc.com
  organization: Alibaba DAMO Academy, Hangzhou, Hangzhou, China
– sequence: 4
  givenname: Ming-Ming
  orcidid: 0000-0001-5550-8758
  surname: Cheng
  fullname: Cheng, Ming-Ming
  email: cmm@nankai.edu.cn
  organization: TMCC, College of Computer Science, Nankai University, Tianjin, China
BookMark eNp9kD1PwzAQhi0EglL4A7BEYmFJOX_EsdlQxZdURCXKbDnOGQWlNtjpwL8nUMTAwHI33PPenZ5DshtiQEJOKMwoBX2xWl493M8YMDbjY61ltUMmjEooNdNsl0yASlYqxdQBOcz5FYCKCvg-OeASBGguJ0Qt2eqyWH4ku-7aYhlj34WXYpVsyD6mNaZibMWTw4DFc2gx5cGGdmSOyJ63fcbjnz4lzzfXq_lduXi8vZ9fLUrHmRpK1NYKWklsbCMBHUDrm6oSjlPunUTNhQYNHqn3baObcd4A9c413ILUkk_J-XbvW4rvG8yDWXfZYd_bgHGTDatBsUpDLUb07A_6GjcpjN8ZpmomBAhWjxTbUi7FnBN685a6tU0fhoL58mq-vZovr-bH6xhSf0KuG-zQxTAk2_X_R0-30Q4Rf29pVUNVaf4J7jaEyg
CODEN ITPIDJ
CitedBy_id crossref_primary_10_1109_TAES_2024_3382622
crossref_primary_10_1109_TPAMI_2024_3476683
crossref_primary_10_1016_j_bspc_2024_107189
crossref_primary_10_1049_ipr2_12895
crossref_primary_10_1109_LGRS_2024_3365509
crossref_primary_10_1109_TGRS_2023_3306018
crossref_primary_10_1016_j_eswa_2025_126727
crossref_primary_10_1109_JIOT_2024_3378701
crossref_primary_10_1109_TSMC_2025_3526234
crossref_primary_10_1007_s11263_024_02247_9
crossref_primary_10_3390_rs16224126
crossref_primary_10_1007_s10489_024_05910_3
crossref_primary_10_1109_TIM_2024_3418104
crossref_primary_10_1029_2023GL103979
crossref_primary_10_1109_ACCESS_2024_3352428
crossref_primary_10_1016_j_engappai_2024_108309
crossref_primary_10_1142_S0129065725500157
crossref_primary_10_1038_s41598_023_40175_9
crossref_primary_10_3390_math11122665
crossref_primary_10_23919_ICN_2024_0023
crossref_primary_10_1016_j_compag_2024_109656
crossref_primary_10_1007_s11517_023_02852_9
crossref_primary_10_1109_TPAMI_2023_3309979
crossref_primary_10_3390_app132111657
crossref_primary_10_1016_j_ins_2024_121855
crossref_primary_10_3390_s25030828
crossref_primary_10_1109_LSP_2024_3365037
crossref_primary_10_1016_j_gloei_2024_11_016
crossref_primary_10_1109_TPAMI_2023_3330825
crossref_primary_10_1109_JSTARS_2024_3461152
crossref_primary_10_20965_jaciii_2023_p1096
crossref_primary_10_1109_TIP_2024_3359816
crossref_primary_10_1016_j_eswa_2024_125427
crossref_primary_10_1016_j_inffus_2024_102401
crossref_primary_10_1109_TIM_2023_3325520
crossref_primary_10_1007_s11263_023_01894_8
crossref_primary_10_1109_ACCESS_2023_3299597
crossref_primary_10_1109_TCSS_2024_3404611
crossref_primary_10_1109_TCSVT_2024_3417607
crossref_primary_10_1109_TIP_2024_3432328
crossref_primary_10_1109_JSTARS_2024_3365729
crossref_primary_10_3390_rs15194817
crossref_primary_10_1016_j_displa_2024_102802
crossref_primary_10_1016_j_imavis_2025_105487
crossref_primary_10_1038_s41598_025_92954_1
crossref_primary_10_1016_j_compbiomed_2023_107336
crossref_primary_10_1016_j_neucom_2024_129204
crossref_primary_10_1109_ACCESS_2024_3507272
crossref_primary_10_1109_TMM_2024_3396281
crossref_primary_10_1109_TMM_2024_3372835
crossref_primary_10_3390_app13169226
crossref_primary_10_3390_s23094206
crossref_primary_10_1109_ACCESS_2024_3513697
crossref_primary_10_1109_JSTARS_2025_3527213
crossref_primary_10_1109_TPAMI_2024_3432168
crossref_primary_10_1007_s41095_023_0364_2
crossref_primary_10_1109_TGRS_2023_3313800
crossref_primary_10_1109_LGRS_2023_3314435
crossref_primary_10_1109_TPAMI_2024_3485898
crossref_primary_10_1109_OJVT_2025_3541891
crossref_primary_10_3390_math10203752
crossref_primary_10_3390_rs17040707
crossref_primary_10_1007_s00521_024_10696_z
crossref_primary_10_1021_acssensors_4c01584
crossref_primary_10_1109_ACCESS_2025_3529812
crossref_primary_10_1049_cit2_12296
crossref_primary_10_1109_TMI_2024_3377248
crossref_primary_10_1016_j_asoc_2025_112950
crossref_primary_10_3389_fpls_2024_1425131
crossref_primary_10_1109_LGRS_2023_3336061
crossref_primary_10_1016_j_asoc_2024_112557
crossref_primary_10_1016_j_eswa_2025_126385
crossref_primary_10_1007_s11042_023_16898_2
crossref_primary_10_3390_electronics11234060
crossref_primary_10_1007_s11227_024_06205_7
crossref_primary_10_1109_TIM_2024_3375987
crossref_primary_10_1371_journal_pone_0262689
crossref_primary_10_1007_s40747_023_01296_w
crossref_primary_10_1109_TPAMI_2024_3408642
crossref_primary_10_3390_electronics12153322
crossref_primary_10_1007_s00371_024_03360_z
crossref_primary_10_1007_s10489_024_05369_2
crossref_primary_10_1109_TGRS_2024_3499363
crossref_primary_10_1109_TGRS_2024_3468876
crossref_primary_10_3390_s23104688
crossref_primary_10_1016_j_compeleceng_2024_109209
crossref_primary_10_1109_TGRS_2024_3400032
crossref_primary_10_1016_j_imavis_2024_105048
crossref_primary_10_1063_5_0153511
crossref_primary_10_1007_s10489_024_05743_0
crossref_primary_10_1117_1_JEI_33_1_013044
crossref_primary_10_1109_TMM_2023_3275308
crossref_primary_10_1016_j_eswa_2025_127004
crossref_primary_10_1109_TPAMI_2023_3248583
crossref_primary_10_1016_j_neunet_2024_106489
Cites_doi 10.1007/s11263-021-01465-9
10.1109/CVPRW.2018.00133
10.1109/ICCV48922.2021.00299
10.1109/tpami.2021.3140168
10.1109/ICCV48922.2021.00675
10.1109/CVPR.2019.00293
10.1109/tpami.2021.3134684
10.1007/978-3-319-10602-1_48
10.1109/TPAMI.2017.2699184
10.1109/ICCV.2017.31
10.1109/CVPR.2018.00474
10.1007/s41095-022-0274-8
10.1109/CVPR46437.2021.00542
10.1109/CVPR.2016.90
10.1109/CVPR.2016.350
10.1109/ICCV48922.2021.01172
10.1007/s11263-009-0275-4
10.3115/v1/W14-3302
10.1109/TPAMI.2019.2913372
10.1109/CVPR.2006.68
10.1007/s10462-020-09825-6
10.1109/CVPR.2019.00656
10.1109/TIP.2021.3065822
10.1109/ICCV.2005.239
10.1016/j.patcog.2020.107622
10.1109/ICCV.2017.324
10.1007/978-3-030-00934-2_3
10.1007/978-3-030-01228-1_15
10.1109/CVPR.2017.544
10.1109/ICCV.2019.00140
10.1016/j.ins.2020.02.067
10.1109/CVPR.2018.00337
10.1007/978-3-030-58452-8_13
10.1109/ICCV48922.2021.01204
10.1109/CVPR.2018.00716
10.1109/ICCV48922.2021.00060
10.1109/ICCV48922.2021.00062
10.1007/s11263-015-0816-y
10.1109/TPAMI.2015.2389824
10.1109/WACV48630.2021.00374
10.1109/CVPR.2015.7298594
10.1109/ICCV.2017.433
10.1109/tpami.2019.2918284
10.1109/TPAMI.2018.2844175
10.1109/CVPR.2018.00567
10.1109/ICCV48922.2021.00986
10.1201/9781420010749
10.1109/ICCV48922.2021.00009
10.1109/ICCV48922.2021.00061
10.1109/TCSVT.2019.2920407
10.1016/j.neucom.2016.12.038
10.1109/CVPRW.2015.7301274
10.1007/978-3-030-01264-9_8
10.1109/iccv48922.2021.00147
10.1109/TPAMI.2019.2938758
10.1109/CVPR.2017.660
10.1109/CVPR.2017.634
10.1109/ICCVW54120.2021.00210
ContentType Journal Article
Copyright Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023
Copyright_xml – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023
DBID 97E
RIA
RIE
AAYXX
CITATION
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
7X8
DOI 10.1109/TPAMI.2022.3202765
DatabaseName IEEE All-Society Periodicals Package (ASPP) 2005–Present
IEEE All-Society Periodicals Package (ASPP) 1998–Present
IEEE Electronic Library (IEL)
CrossRef
Computer and Information Systems Abstracts
Electronics & Communications Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
MEDLINE - Academic
DatabaseTitle CrossRef
Technology Research Database
Computer and Information Systems Abstracts – Academic
Electronics & Communications Abstracts
ProQuest Computer Science Collection
Computer and Information Systems Abstracts
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts Professional
MEDLINE - Academic
DatabaseTitleList MEDLINE - Academic

Technology Research Database
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
Computer Science
EISSN 2160-9292
1939-3539
EndPage 12771
ExternalDocumentID 10_1109_TPAMI_2022_3202765
9870559
Genre orig-research
GrantInformation_xml – fundername: National Natural Science Foundation of China; NSFC
  grantid: 61922046
  funderid: 10.13039/501100001809
– fundername: Alibaba Research Intern Program
– fundername: New Generation of AI
  grantid: 2018AAA0100400
– fundername: Alibaba Innovative Research
– fundername: Agency for Science, Technology and Research
  funderid: 10.13039/501100001348
– fundername: AME Programmatic Funds
  grantid: A1892b0026; A19E3b0099
GroupedDBID ---
-DZ
-~X
.DC
0R~
29I
4.4
53G
5GY
6IK
97E
AAJGR
AARMG
AASAJ
AAWTH
ABAZT
ABQJQ
ABVLG
ACGFO
ACGFS
ACIWK
ACNCT
AENEX
AGQYO
AHBIQ
AKJIK
AKQYR
ALMA_UNASSIGNED_HOLDINGS
ASUFR
ATWAV
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CS3
DU5
E.L
EBS
EJD
F5P
HZ~
IEDLZ
IFIPE
IPLJI
JAVBF
LAI
M43
MS~
O9-
OCL
P2P
PQQKQ
RIA
RIE
RNS
RXW
TAE
TN5
UHB
~02
AAYXX
CITATION
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
7X8
ID FETCH-LOGICAL-c328t-e9aa4156ebab60ec00dfb554c313fc6e9349090fe1ffdb9b00db01fccb3a06963
IEDL.DBID RIE
ISSN 0162-8828
1939-3539
IngestDate Fri Jul 11 02:35:05 EDT 2025
Mon Jun 30 06:22:43 EDT 2025
Thu Apr 24 23:04:15 EDT 2025
Tue Jul 01 01:43:04 EDT 2025
Wed Aug 27 02:24:54 EDT 2025
IsPeerReviewed true
IsScholarly true
Issue 11
Language English
License https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html
https://doi.org/10.15223/policy-029
https://doi.org/10.15223/policy-037
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c328t-e9aa4156ebab60ec00dfb554c313fc6e9349090fe1ffdb9b00db01fccb3a06963
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ORCID 0000-0001-6143-0264
0000-0001-8666-3435
0000-0001-5550-8758
PMID 36040936
PQID 2872440427
PQPubID 85458
PageCount 12
ParticipantIDs proquest_miscellaneous_2708259074
proquest_journals_2872440427
ieee_primary_9870559
crossref_citationtrail_10_1109_TPAMI_2022_3202765
crossref_primary_10_1109_TPAMI_2022_3202765
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 2023-11-01
PublicationDateYYYYMMDD 2023-11-01
PublicationDate_xml – month: 11
  year: 2023
  text: 2023-11-01
  day: 01
PublicationDecade 2020
PublicationPlace New York
PublicationPlace_xml – name: New York
PublicationTitle IEEE transactions on pattern analysis and machine intelligence
PublicationTitleAbbrev TPAMI
PublicationYear 2023
Publisher IEEE
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher_xml – name: IEEE
– name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
References ref13
ref57
ref12
ref56
ref59
ref14
ref58
Glorot (ref77)
ref53
Hu (ref47) 2021
ref11
ref55
ref10
ref54
ref16
ref18
Han (ref33) 2021
Chu (ref28)
Chu (ref50) 2021
Liu (ref24) 2021
ref46
ref45
ref42
ref41
ref44
Simonyan (ref2)
ref43
ref49
ref8
ref9
ref4
ref3
ref6
ref5
ref40
Tan (ref7)
ref35
ref34
ref37
ref36
ref31
ref75
ref30
Li (ref52) 2021
ref74
ref32
ref1
ref39
Dosovitskiy (ref19)
Contributors (ref78) 2020
ref71
Jiang (ref51) 2021
ref73
Zhu (ref17) 2020
ref68
Hendrycks (ref72) 2016
ref23
ref67
ref26
Dong (ref70) 2021
ref25
ref20
ref64
Touvron (ref48)
ref63
ref22
ref66
ref21
ref65
Loshchilov (ref76)
Howard (ref38) 2017
ref27
ref29
Ba (ref69) 2016
ref60
ref62
Vaswani (ref15)
ref61
Chen (ref79) 2019
References_xml – ident: ref56
  doi: 10.1007/s11263-021-01465-9
– ident: ref65
  doi: 10.1109/CVPRW.2018.00133
– ident: ref73
  doi: 10.1109/ICCV48922.2021.00299
– ident: ref63
  doi: 10.1109/tpami.2021.3140168
– start-page: 6000
  volume-title: Proc. Adv. Neural Inform. Process. Syst.
  ident: ref15
  article-title: Attention is all you need
– ident: ref23
  doi: 10.1109/ICCV48922.2021.00675
– ident: ref42
  doi: 10.1109/CVPR.2019.00293
– ident: ref8
  doi: 10.1109/tpami.2021.3134684
– ident: ref11
  doi: 10.1007/978-3-319-10602-1_48
– ident: ref54
  doi: 10.1109/TPAMI.2017.2699184
– ident: ref61
  doi: 10.1109/ICCV.2017.31
– ident: ref39
  doi: 10.1109/CVPR.2018.00474
– ident: ref29
  doi: 10.1007/s41095-022-0274-8
– ident: ref45
  doi: 10.1109/CVPR46437.2021.00542
– ident: ref4
  doi: 10.1109/CVPR.2016.90
– start-page: 249
  volume-title: Proc. Int. Conf. Artif. Intell. Statist.
  ident: ref77
  article-title: Understanding the difficulty of training deep feedforward neural networks
– ident: ref13
  doi: 10.1109/CVPR.2016.350
– ident: ref20
  doi: 10.1109/ICCV48922.2021.01172
– ident: ref12
  doi: 10.1007/s11263-009-0275-4
– ident: ref30
  doi: 10.3115/v1/W14-3302
– ident: ref6
  doi: 10.1109/TPAMI.2019.2913372
– ident: ref35
  doi: 10.1109/CVPR.2006.68
– volume-title: Proc. Int. Conf. Learn. Represent.
  ident: ref19
  article-title: An image is worth 16x16 words: Transformers for image recognition at scale
– ident: ref43
  doi: 10.1007/s10462-020-09825-6
– ident: ref26
  doi: 10.1109/CVPR.2019.00656
– ident: ref64
  doi: 10.1109/TIP.2021.3065822
– ident: ref34
  doi: 10.1109/ICCV.2005.239
– ident: ref57
  doi: 10.1016/j.patcog.2020.107622
– ident: ref74
  doi: 10.1109/ICCV.2017.324
– year: 2021
  ident: ref70
  article-title: Attention is not all you need: Pure attention loses rank doubly exponentially with depth
– ident: ref55
  doi: 10.1007/978-3-030-00934-2_3
– year: 2020
  ident: ref17
  article-title: Deformable DETR: Deformable transformers for end-to-end object detection
– ident: ref59
  doi: 10.1007/978-3-030-01228-1_15
– ident: ref14
  doi: 10.1109/CVPR.2017.544
– ident: ref71
  doi: 10.1109/ICCV.2019.00140
– ident: ref60
  doi: 10.1016/j.ins.2020.02.067
– ident: ref67
  doi: 10.1109/CVPR.2018.00337
– ident: ref16
  doi: 10.1007/978-3-030-58452-8_13
– volume-title: Proc. Int. Conf. Learn. Represent.
  ident: ref76
  article-title: Decoupled weight decay regularization
– ident: ref32
  doi: 10.1109/ICCV48922.2021.01204
– volume-title: Proc. Int. Conf. Learn. Represent.
  ident: ref2
  article-title: Very deep convolutional networks for large-scale image recognition
– ident: ref41
  doi: 10.1109/CVPR.2018.00716
– ident: ref49
  doi: 10.1109/ICCV48922.2021.00060
– ident: ref53
  doi: 10.1109/ICCV48922.2021.00062
– year: 2016
  ident: ref72
  article-title: Gaussian error linear units (GELUs)
– ident: ref10
  doi: 10.1007/s11263-015-0816-y
– ident: ref36
  doi: 10.1109/TPAMI.2015.2389824
– ident: ref46
  doi: 10.1109/WACV48630.2021.00374
– ident: ref3
  doi: 10.1109/CVPR.2015.7298594
– ident: ref68
  doi: 10.1109/ICCV.2017.433
– ident: ref5
  doi: 10.1109/tpami.2019.2918284
– year: 2021
  ident: ref52
  article-title: LocalViT: Bringing locality to vision transformers
– ident: ref75
  doi: 10.1109/TPAMI.2018.2844175
– year: 2020
  ident: ref78
  article-title: MMSegmentation: OpenMMLab semantic segmentation toolbox and benchmark
– year: 2021
  ident: ref24
  article-title: Transformer in convolutional neural networks
– ident: ref62
  doi: 10.1109/CVPR.2018.00567
– start-page: 6105
  volume-title: Proc. Int. Conf. Mach. Learn.
  ident: ref7
  article-title: EfficientNet: Rethinking model scaling for convolutional neural networks
– ident: ref22
  doi: 10.1109/ICCV48922.2021.00986
– year: 2021
  ident: ref50
  article-title: Conditional positional encodings for vision transformers
– ident: ref1
  doi: 10.1201/9781420010749
– year: 2021
  ident: ref51
  article-title: Token labeling: Training a 85.5% top-1 accuracy vision transformer with 56M parameters on ImageNet
– ident: ref31
  doi: 10.1109/ICCV48922.2021.00009
– year: 2021
  ident: ref33
  article-title: Demystifying local vision transformer: Sparse connectivity, weight sharing, and dynamic weight
– start-page: 10347
  volume-title: Proc. Int. Conf. Mach. Learn.
  ident: ref48
  article-title: Training data-efficient image transformers distillation through attention
– ident: ref21
  doi: 10.1109/ICCV48922.2021.00061
– ident: ref66
  doi: 10.1109/TCSVT.2019.2920407
– ident: ref44
  doi: 10.1016/j.neucom.2016.12.038
– ident: ref58
  doi: 10.1109/CVPRW.2015.7301274
– ident: ref40
  doi: 10.1007/978-3-030-01264-9_8
– ident: ref18
  doi: 10.1109/iccv48922.2021.00147
– ident: ref9
  doi: 10.1109/TPAMI.2019.2938758
– year: 2019
  ident: ref79
  article-title: MMDetection: Open MMLab detection toolbox and benchmark
– ident: ref37
  doi: 10.1109/CVPR.2017.660
– year: 2017
  ident: ref38
  article-title: MobileNets: Efficient convolutional neural networks for mobile vision applications
– start-page: 9355
  volume-title: Proc. Adv. Neural Inform. Process. Syst.
  ident: ref28
  article-title: Twins: Revisiting the design of spatial attention in vision transformers
– year: 2016
  ident: ref69
  article-title: Layer normalization
– ident: ref27
  doi: 10.1109/CVPR.2017.634
– year: 2021
  ident: ref47
  article-title: ISTR: End-to-end instance segmentation with transformers
– ident: ref25
  doi: 10.1109/ICCVW54120.2021.00210
SSID ssj0014503
Score 2.6993296
Snippet Recently, the vision transformer has achieved great success by pushing the state-of-the-art of various vision tasks. One of the most challenging problems in...
SourceID proquest
crossref
ieee
SourceType Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 12760
SubjectTerms backbone network
Computational modeling
Computer networks
Convolution
efficient self-attention
Feature extraction
Image classification
Image segmentation
Network design
Object recognition
pyramid pooling
Scene analysis
scene understanding
Semantic segmentation
Semantics
Task analysis
Transformer
Transformers
Title P2T: Pyramid Pooling Transformer for Scene Understanding
URI https://ieeexplore.ieee.org/document/9870559
https://www.proquest.com/docview/2872440427
https://www.proquest.com/docview/2708259074
Volume 45
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3fT8IwEL4AT_ogChpRNDPxTQelYx31jRgJmmB4gIS3ZetaY1QwCA_613vX_ZCoMT5tSdul6_Xa-3q9-wDOA4-cZwhTTSIQoAhuXKl6yu0msuczXP-ETwHOo3sxnHbvZv6sBJdFLIzW2l4-0y16tb78ZKHWdFTWRnzM0AIuQxmBWxqrVXgMur5lQUYLBjUcYUQeIMNkezLuj24RCnLeIrbwQBBdjSdw-kqbmflrP7IEKz9WZbvVDKowyjuZ3jB5aq1XcUt9fMvf-N-_2IWdzOZ0-ukk2YOSntegmvM5OJl612B7IzlhHXpjPrlyxu_L6OUxccYLYvd5cCa5oYsN8YGNcbF0ppshMvswHdxMroduxrPgKo_3Vq6WUUQ4TsdRLJhWjCUmRjNDeR3PKKGl15VMMqM7xiQxJVFMYtYxSsVexARq8AFU5ou5PgSHJ36gAzQKAzQbAoGFyuAGSNGyhkciaEAnH-1QZUnIiQvjObRghMnQCiskYYWZsBpwUbR5TVNw_Fm7TkNe1MxGuwHNXKhhpqVvIaJFTvkROfbrrChG_SKnSTTXizXWCQhE0xHC0e9fPoYtoqBP4xObUFkt1_oEDZVVfGpn6Cfr4d9t
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwzV3NTuMwEB4BewAOy_InuhQIEpxQius0To3EAcGiFijqIZW4hcSxEQLaFbRalWfhVXg3ZpwfEKC9IXFKpNiW4xnPzOfxzABsBR45zxCmmlQgQBHcuFI1ldtIZdNnKP-ETwHOnXPR6jVOLvyLCXgqY2G01vbyma7Rq_XlpwM1oqOyXcTHDC3g_ArlqR7_Q4D2sN8-Qmpuc378JzxsuXkNAVd5vDl0tYxjwig6iRPBtGIsNQmqUOXVPaOEll5DMsmMrhuTJpQgME1Y3SiVeDETyJ047iT8QDvD51l0WOmjaPi27jLaTChTELgUITlM7obdg04bwSfnNapPHggqkOMJ3DDS5oJ-1YC2pMsHPWCV2_EcPBfLkt1puamNhklNPb7LGPld1-0X_Mytaucg2wbzMKH7CzBXVKxwcgG2ALNv0i8uQrPLwz2nO76P765Tpzug-kVXTliY8tgRH9gZ1YHTexsEtAS9L_mbZZjqD_p6BRye-oEO0OwN0DAKBH5UBlU8xQMbHougAvWCupHK06xTtY_byMItJiPLHBExR5QzRwV2yj5_syQj_229SCQuW-bUrUC1YKIol0MPEeJhThkgOc5rs_yMEoTcQnFfD0bYJqBjAjok-f35yBsw3Qo7Z9FZ-_x0FWZwFl4WjVmFqeH9SK-hWTZM1u3ucODyq9nqBZMPP7Y
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=P2T%3A+Pyramid+Pooling+Transformer+for+Scene+Understanding&rft.jtitle=IEEE+transactions+on+pattern+analysis+and+machine+intelligence&rft.au=Wu%2C+Yu-Huan&rft.au=Liu%2C+Yun&rft.au=Zhan%2C+Xin&rft.au=Cheng%2C+Ming-Ming&rft.date=2023-11-01&rft.issn=0162-8828&rft.eissn=2160-9292&rft.volume=45&rft.issue=11&rft.spage=12760&rft.epage=12771&rft_id=info:doi/10.1109%2FTPAMI.2022.3202765&rft.externalDBID=n%2Fa&rft.externalDocID=10_1109_TPAMI_2022_3202765
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0162-8828&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0162-8828&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0162-8828&client=summon