P2T: Pyramid Pooling Transformer for Scene Understanding

Recently, the vision transformer has achieved great success by pushing the state-of-the-art of various vision tasks. One of the most challenging problems in the vision transformer is that the large sequence length of image tokens leads to high computational cost (quadratic complexity). A popular sol...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on pattern analysis and machine intelligence Vol. 45; no. 11; pp. 12760 - 12771
Main Authors	Wu, Yu-Huan, Liu, Yun, Zhan, Xin, Cheng, Ming-Ming
Format	Journal Article
Language	English
Published	New York IEEE 01.11.2023 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	backbone network Computational modeling Computer networks Convolution efficient self-attention Feature extraction Image classification Image segmentation Network design Object recognition pyramid pooling Scene analysis scene understanding Semantic segmentation Semantics Task analysis Transformer Transformers
Online Access	Get full text

Cover

Loading…

Abstract	Recently, the vision transformer has achieved great success by pushing the state-of-the-art of various vision tasks. One of the most challenging problems in the vision transformer is that the large sequence length of image tokens leads to high computational cost (quadratic complexity). A popular solution to this problem is to use a single pooling operation to reduce the sequence length. This paper considers how to improve existing vision transformers, where the pooled feature extracted by a single pooling operation seems less powerful. To this end, we note that pyramid pooling has been demonstrated to be effective in various vision tasks owing to its powerful ability in context abstraction. However, pyramid pooling has not been explored in backbone network design. To bridge this gap, we propose to adapt pyramid pooling to Multi-Head Self-Attention (MHSA) in the vision transformer, simultaneously reducing the sequence length and capturing powerful contextual features. Plugged with our pooling-based MHSA, we build a universal vision transformer backbone, dubbed Pyramid Pooling Transformer (P2T). Extensive experiments demonstrate that, when applied P2T as the backbone network, it shows substantial superiority in various vision tasks such as image classification, semantic segmentation, object detection, and instance segmentation, compared to previous CNN- and transformer-based networks. The code will be released at https://github.com/yuhuan-wu/P2T .
AbstractList	Recently, the vision transformer has achieved great success by pushing the state-of-the-art of various vision tasks. One of the most challenging problems in the vision transformer is that the large sequence length of image tokens leads to high computational cost (quadratic complexity). A popular solution to this problem is to use a single pooling operation to reduce the sequence length. This paper considers how to improve existing vision transformers, where the pooled feature extracted by a single pooling operation seems less powerful. To this end, we note that pyramid pooling has been demonstrated to be effective in various vision tasks owing to its powerful ability in context abstraction. However, pyramid pooling has not been explored in backbone network design. To bridge this gap, we propose to adapt pyramid pooling to Multi-Head Self-Attention (MHSA) in the vision transformer, simultaneously reducing the sequence length and capturing powerful contextual features. Plugged with our pooling-based MHSA, we build a universal vision transformer backbone, dubbed Pyramid Pooling Transformer (P2T). Extensive experiments demonstrate that, when applied P2T as the backbone network, it shows substantial superiority in various vision tasks such as image classification, semantic segmentation, object detection, and instance segmentation, compared to previous CNN- and transformer-based networks. The code will be released at https://github.com/yuhuan-wu/P2T.Recently, the vision transformer has achieved great success by pushing the state-of-the-art of various vision tasks. One of the most challenging problems in the vision transformer is that the large sequence length of image tokens leads to high computational cost (quadratic complexity). A popular solution to this problem is to use a single pooling operation to reduce the sequence length. This paper considers how to improve existing vision transformers, where the pooled feature extracted by a single pooling operation seems less powerful. To this end, we note that pyramid pooling has been demonstrated to be effective in various vision tasks owing to its powerful ability in context abstraction. However, pyramid pooling has not been explored in backbone network design. To bridge this gap, we propose to adapt pyramid pooling to Multi-Head Self-Attention (MHSA) in the vision transformer, simultaneously reducing the sequence length and capturing powerful contextual features. Plugged with our pooling-based MHSA, we build a universal vision transformer backbone, dubbed Pyramid Pooling Transformer (P2T). Extensive experiments demonstrate that, when applied P2T as the backbone network, it shows substantial superiority in various vision tasks such as image classification, semantic segmentation, object detection, and instance segmentation, compared to previous CNN- and transformer-based networks. The code will be released at https://github.com/yuhuan-wu/P2T. Recently, the vision transformer has achieved great success by pushing the state-of-the-art of various vision tasks. One of the most challenging problems in the vision transformer is that the large sequence length of image tokens leads to high computational cost (quadratic complexity). A popular solution to this problem is to use a single pooling operation to reduce the sequence length. This paper considers how to improve existing vision transformers, where the pooled feature extracted by a single pooling operation seems less powerful. To this end, we note that pyramid pooling has been demonstrated to be effective in various vision tasks owing to its powerful ability in context abstraction. However, pyramid pooling has not been explored in backbone network design. To bridge this gap, we propose to adapt pyramid pooling to Multi-Head Self-Attention (MHSA) in the vision transformer, simultaneously reducing the sequence length and capturing powerful contextual features. Plugged with our pooling-based MHSA, we build a universal vision transformer backbone, dubbed Pyramid Pooling Transformer (P2T). Extensive experiments demonstrate that, when applied P2T as the backbone network, it shows substantial superiority in various vision tasks such as image classification, semantic segmentation, object detection, and instance segmentation, compared to previous CNN- and transformer-based networks. The code will be released at https://github.com/yuhuan-wu/P2T .
Author	Zhan, Xin Wu, Yu-Huan Cheng, Ming-Ming Liu, Yun
Author_xml	– sequence: 1 givenname: Yu-Huan orcidid: 0000-0001-8666-3435 surname: Wu fullname: Wu, Yu-Huan email: wuyuhuan@mail.nankai.edu.cn organization: TMCC, College of Computer Science, Nankai University, Tianjin, China – sequence: 2 givenname: Yun orcidid: 0000-0001-6143-0264 surname: Liu fullname: Liu, Yun email: vagrantlyun@gmail.com organization: Institute for Infocomm Research (I2R), Agency for Science, Technology and Research (ASTAR), Singapore – sequence: 3 givenname: Xin surname: Zhan fullname: Zhan, Xin email: zhanxin.zx@alibabainc.com organization: Alibaba DAMO Academy, Hangzhou, Hangzhou, China – sequence: 4 givenname: Ming-Ming orcidid: 0000-0001-5550-8758 surname: Cheng fullname: Cheng, Ming-Ming email: cmm@nankai.edu.cn organization: TMCC, College of Computer Science, Nankai University, Tianjin, China
BookMark	eNp9kD1PwzAQhi0EglL4A7BEYmFJOX_EsdlQxZdURCXKbDnOGQWlNtjpwL8nUMTAwHI33PPenZ5DshtiQEJOKMwoBX2xWl493M8YMDbjY61ltUMmjEooNdNsl0yASlYqxdQBOcz5FYCKCvg-OeASBGguJ0Qt2eqyWH4ku-7aYhlj34WXYpVsyD6mNaZibMWTw4DFc2gx5cGGdmSOyJ63fcbjnz4lzzfXq_lduXi8vZ9fLUrHmRpK1NYKWklsbCMBHUDrm6oSjlPunUTNhQYNHqn3baObcd4A9c413ILUkk_J-XbvW4rvG8yDWXfZYd_bgHGTDatBsUpDLUb07A_6GjcpjN8ZpmomBAhWjxTbUi7FnBN685a6tU0fhoL58mq-vZovr-bH6xhSf0KuG-zQxTAk2_X_R0-30Q4Rf29pVUNVaf4J7jaEyg
CODEN	ITPIDJ
CitedBy_id	crossref_primary_10_1109_TAES_2024_3382622 crossref_primary_10_1109_TPAMI_2024_3476683 crossref_primary_10_1016_j_bspc_2024_107189 crossref_primary_10_1049_ipr2_12895 crossref_primary_10_1109_LGRS_2024_3365509 crossref_primary_10_1109_TGRS_2023_3306018 crossref_primary_10_1016_j_eswa_2025_126727 crossref_primary_10_1109_JIOT_2024_3378701 crossref_primary_10_1109_TSMC_2025_3526234 crossref_primary_10_1007_s11263_024_02247_9 crossref_primary_10_3390_rs16224126 crossref_primary_10_1007_s10489_024_05910_3 crossref_primary_10_1109_TIM_2024_3418104 crossref_primary_10_1029_2023GL103979 crossref_primary_10_1109_ACCESS_2024_3352428 crossref_primary_10_1016_j_engappai_2024_108309 crossref_primary_10_1142_S0129065725500157 crossref_primary_10_1038_s41598_023_40175_9 crossref_primary_10_3390_math11122665 crossref_primary_10_23919_ICN_2024_0023 crossref_primary_10_1016_j_compag_2024_109656 crossref_primary_10_1007_s11517_023_02852_9 crossref_primary_10_1109_TPAMI_2023_3309979 crossref_primary_10_3390_app132111657 crossref_primary_10_1016_j_ins_2024_121855 crossref_primary_10_3390_s25030828 crossref_primary_10_1109_LSP_2024_3365037 crossref_primary_10_1016_j_gloei_2024_11_016 crossref_primary_10_1109_TPAMI_2023_3330825 crossref_primary_10_1109_JSTARS_2024_3461152 crossref_primary_10_20965_jaciii_2023_p1096 crossref_primary_10_1109_TIP_2024_3359816 crossref_primary_10_1016_j_eswa_2024_125427 crossref_primary_10_1016_j_inffus_2024_102401 crossref_primary_10_1109_TIM_2023_3325520 crossref_primary_10_1007_s11263_023_01894_8 crossref_primary_10_1109_ACCESS_2023_3299597 crossref_primary_10_1109_TCSS_2024_3404611 crossref_primary_10_1109_TCSVT_2024_3417607 crossref_primary_10_1109_TIP_2024_3432328 crossref_primary_10_1109_JSTARS_2024_3365729 crossref_primary_10_3390_rs15194817 crossref_primary_10_1016_j_displa_2024_102802 crossref_primary_10_1016_j_imavis_2025_105487 crossref_primary_10_1038_s41598_025_92954_1 crossref_primary_10_1016_j_compbiomed_2023_107336 crossref_primary_10_1016_j_neucom_2024_129204 crossref_primary_10_1109_ACCESS_2024_3507272 crossref_primary_10_1109_TMM_2024_3396281 crossref_primary_10_1109_TMM_2024_3372835 crossref_primary_10_3390_app13169226 crossref_primary_10_3390_s23094206 crossref_primary_10_1109_ACCESS_2024_3513697 crossref_primary_10_1109_JSTARS_2025_3527213 crossref_primary_10_1109_TPAMI_2024_3432168 crossref_primary_10_1007_s41095_023_0364_2 crossref_primary_10_1109_TGRS_2023_3313800 crossref_primary_10_1109_LGRS_2023_3314435 crossref_primary_10_1109_TPAMI_2024_3485898 crossref_primary_10_1109_OJVT_2025_3541891 crossref_primary_10_3390_math10203752 crossref_primary_10_3390_rs17040707 crossref_primary_10_1007_s00521_024_10696_z crossref_primary_10_1021_acssensors_4c01584 crossref_primary_10_1109_ACCESS_2025_3529812 crossref_primary_10_1049_cit2_12296 crossref_primary_10_1109_TMI_2024_3377248 crossref_primary_10_1016_j_asoc_2025_112950 crossref_primary_10_3389_fpls_2024_1425131 crossref_primary_10_1109_LGRS_2023_3336061 crossref_primary_10_1016_j_asoc_2024_112557 crossref_primary_10_1016_j_eswa_2025_126385 crossref_primary_10_1007_s11042_023_16898_2 crossref_primary_10_3390_electronics11234060 crossref_primary_10_1007_s11227_024_06205_7 crossref_primary_10_1109_TIM_2024_3375987 crossref_primary_10_1371_journal_pone_0262689 crossref_primary_10_1007_s40747_023_01296_w crossref_primary_10_1109_TPAMI_2024_3408642 crossref_primary_10_3390_electronics12153322 crossref_primary_10_1007_s00371_024_03360_z crossref_primary_10_1007_s10489_024_05369_2 crossref_primary_10_1109_TGRS_2024_3499363 crossref_primary_10_1109_TGRS_2024_3468876 crossref_primary_10_3390_s23104688 crossref_primary_10_1016_j_compeleceng_2024_109209 crossref_primary_10_1109_TGRS_2024_3400032 crossref_primary_10_1016_j_imavis_2024_105048 crossref_primary_10_1063_5_0153511 crossref_primary_10_1007_s10489_024_05743_0 crossref_primary_10_1117_1_JEI_33_1_013044 crossref_primary_10_1109_TMM_2023_3275308 crossref_primary_10_1016_j_eswa_2025_127004 crossref_primary_10_1109_TPAMI_2023_3248583 crossref_primary_10_1016_j_neunet_2024_106489
Cites_doi	10.1007/s11263-021-01465-9 10.1109/CVPRW.2018.00133 10.1109/ICCV48922.2021.00299 10.1109/tpami.2021.3140168 10.1109/ICCV48922.2021.00675 10.1109/CVPR.2019.00293 10.1109/tpami.2021.3134684 10.1007/978-3-319-10602-1_48 10.1109/TPAMI.2017.2699184 10.1109/ICCV.2017.31 10.1109/CVPR.2018.00474 10.1007/s41095-022-0274-8 10.1109/CVPR46437.2021.00542 10.1109/CVPR.2016.90 10.1109/CVPR.2016.350 10.1109/ICCV48922.2021.01172 10.1007/s11263-009-0275-4 10.3115/v1/W14-3302 10.1109/TPAMI.2019.2913372 10.1109/CVPR.2006.68 10.1007/s10462-020-09825-6 10.1109/CVPR.2019.00656 10.1109/TIP.2021.3065822 10.1109/ICCV.2005.239 10.1016/j.patcog.2020.107622 10.1109/ICCV.2017.324 10.1007/978-3-030-00934-2_3 10.1007/978-3-030-01228-1_15 10.1109/CVPR.2017.544 10.1109/ICCV.2019.00140 10.1016/j.ins.2020.02.067 10.1109/CVPR.2018.00337 10.1007/978-3-030-58452-8_13 10.1109/ICCV48922.2021.01204 10.1109/CVPR.2018.00716 10.1109/ICCV48922.2021.00060 10.1109/ICCV48922.2021.00062 10.1007/s11263-015-0816-y 10.1109/TPAMI.2015.2389824 10.1109/WACV48630.2021.00374 10.1109/CVPR.2015.7298594 10.1109/ICCV.2017.433 10.1109/tpami.2019.2918284 10.1109/TPAMI.2018.2844175 10.1109/CVPR.2018.00567 10.1109/ICCV48922.2021.00986 10.1201/9781420010749 10.1109/ICCV48922.2021.00009 10.1109/ICCV48922.2021.00061 10.1109/TCSVT.2019.2920407 10.1016/j.neucom.2016.12.038 10.1109/CVPRW.2015.7301274 10.1007/978-3-030-01264-9_8 10.1109/iccv48922.2021.00147 10.1109/TPAMI.2019.2938758 10.1109/CVPR.2017.660 10.1109/CVPR.2017.634 10.1109/ICCVW54120.2021.00210
ContentType	Journal Article
Copyright	Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023
Copyright_xml	– notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023
DBID	97E RIA RIE AAYXX CITATION 7SC 7SP 8FD JQ2 L7M L~C L~D 7X8
DOI	10.1109/TPAMI.2022.3202765
DatabaseName	IEEE All-Society Periodicals Package (ASPP) 2005–Present IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Electronic Library (IEL) CrossRef Computer and Information Systems Abstracts Electronics & Communications Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional MEDLINE - Academic
DatabaseTitle	CrossRef Technology Research Database Computer and Information Systems Abstracts – Academic Electronics & Communications Abstracts ProQuest Computer Science Collection Computer and Information Systems Abstracts Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Professional MEDLINE - Academic
DatabaseTitleList	MEDLINE - Academic Technology Research Database
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Engineering Computer Science
EISSN	2160-9292 1939-3539
EndPage	12771
ExternalDocumentID	10_1109_TPAMI_2022_3202765 9870559
Genre	orig-research
GrantInformation_xml	– fundername: National Natural Science Foundation of China; NSFC grantid: 61922046 funderid: 10.13039/501100001809 – fundername: Alibaba Research Intern Program – fundername: New Generation of AI grantid: 2018AAA0100400 – fundername: Alibaba Innovative Research – fundername: Agency for Science, Technology and Research funderid: 10.13039/501100001348 – fundername: AME Programmatic Funds grantid: A1892b0026; A19E3b0099
GroupedDBID	--- -DZ -~X .DC 0R~ 29I 4.4 53G 5GY 6IK 97E AAJGR AARMG AASAJ AAWTH ABAZT ABQJQ ABVLG ACGFO ACGFS ACIWK ACNCT AENEX AGQYO AHBIQ AKJIK AKQYR ALMA_UNASSIGNED_HOLDINGS ASUFR ATWAV BEFXN BFFAM BGNUA BKEBE BPEOZ CS3 DU5 E.L EBS EJD F5P HZ~ IEDLZ IFIPE IPLJI JAVBF LAI M43 MS~ O9- OCL P2P PQQKQ RIA RIE RNS RXW TAE TN5 UHB ~02 AAYXX CITATION 7SC 7SP 8FD JQ2 L7M L~C L~D 7X8
ID	FETCH-LOGICAL-c328t-e9aa4156ebab60ec00dfb554c313fc6e9349090fe1ffdb9b00db01fccb3a06963
IEDL.DBID	RIE
ISSN	0162-8828 1939-3539
IngestDate	Fri Jul 11 02:35:05 EDT 2025 Mon Jun 30 06:22:43 EDT 2025 Thu Apr 24 23:04:15 EDT 2025 Tue Jul 01 01:43:04 EDT 2025 Wed Aug 27 02:24:54 EDT 2025
IsPeerReviewed	true
IsScholarly	true
Issue	11
Language	English
License	https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html https://doi.org/10.15223/policy-029 https://doi.org/10.15223/policy-037
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c328t-e9aa4156ebab60ec00dfb554c313fc6e9349090fe1ffdb9b00db01fccb3a06963
Notes	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ORCID	0000-0001-6143-0264 0000-0001-8666-3435 0000-0001-5550-8758
PMID	36040936
PQID	2872440427
PQPubID	85458
PageCount	12
ParticipantIDs	proquest_miscellaneous_2708259074 proquest_journals_2872440427 ieee_primary_9870559 crossref_citationtrail_10_1109_TPAMI_2022_3202765 crossref_primary_10_1109_TPAMI_2022_3202765
ProviderPackageCode	CITATION AAYXX
PublicationCentury	2000
PublicationDate	2023-11-01
PublicationDateYYYYMMDD	2023-11-01
PublicationDate_xml	– month: 11 year: 2023 text: 2023-11-01 day: 01
PublicationDecade	2020
PublicationPlace	New York
PublicationPlace_xml	– name: New York
PublicationTitle	IEEE transactions on pattern analysis and machine intelligence
PublicationTitleAbbrev	TPAMI
PublicationYear	2023
Publisher	IEEE The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher_xml	– name: IEEE – name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
References	ref13 ref57 ref12 ref56 ref59 ref14 ref58 Glorot (ref77) ref53 Hu (ref47) 2021 ref11 ref55 ref10 ref54 ref16 ref18 Han (ref33) 2021 Chu (ref28) Chu (ref50) 2021 Liu (ref24) 2021 ref46 ref45 ref42 ref41 ref44 Simonyan (ref2) ref43 ref49 ref8 ref9 ref4 ref3 ref6 ref5 ref40 Tan (ref7) ref35 ref34 ref37 ref36 ref31 ref75 ref30 Li (ref52) 2021 ref74 ref32 ref1 ref39 Dosovitskiy (ref19) Contributors (ref78) 2020 ref71 Jiang (ref51) 2021 ref73 Zhu (ref17) 2020 ref68 Hendrycks (ref72) 2016 ref23 ref67 ref26 Dong (ref70) 2021 ref25 ref20 ref64 Touvron (ref48) ref63 ref22 ref66 ref21 ref65 Loshchilov (ref76) Howard (ref38) 2017 ref27 ref29 Ba (ref69) 2016 ref60 ref62 Vaswani (ref15) ref61 Chen (ref79) 2019
References_xml	– ident: ref56 doi: 10.1007/s11263-021-01465-9 – ident: ref65 doi: 10.1109/CVPRW.2018.00133 – ident: ref73 doi: 10.1109/ICCV48922.2021.00299 – ident: ref63 doi: 10.1109/tpami.2021.3140168 – start-page: 6000 volume-title: Proc. Adv. Neural Inform. Process. Syst. ident: ref15 article-title: Attention is all you need – ident: ref23 doi: 10.1109/ICCV48922.2021.00675 – ident: ref42 doi: 10.1109/CVPR.2019.00293 – ident: ref8 doi: 10.1109/tpami.2021.3134684 – ident: ref11 doi: 10.1007/978-3-319-10602-1_48 – ident: ref54 doi: 10.1109/TPAMI.2017.2699184 – ident: ref61 doi: 10.1109/ICCV.2017.31 – ident: ref39 doi: 10.1109/CVPR.2018.00474 – ident: ref29 doi: 10.1007/s41095-022-0274-8 – ident: ref45 doi: 10.1109/CVPR46437.2021.00542 – ident: ref4 doi: 10.1109/CVPR.2016.90 – start-page: 249 volume-title: Proc. Int. Conf. Artif. Intell. Statist. ident: ref77 article-title: Understanding the difficulty of training deep feedforward neural networks – ident: ref13 doi: 10.1109/CVPR.2016.350 – ident: ref20 doi: 10.1109/ICCV48922.2021.01172 – ident: ref12 doi: 10.1007/s11263-009-0275-4 – ident: ref30 doi: 10.3115/v1/W14-3302 – ident: ref6 doi: 10.1109/TPAMI.2019.2913372 – ident: ref35 doi: 10.1109/CVPR.2006.68 – volume-title: Proc. Int. Conf. Learn. Represent. ident: ref19 article-title: An image is worth 16x16 words: Transformers for image recognition at scale – ident: ref43 doi: 10.1007/s10462-020-09825-6 – ident: ref26 doi: 10.1109/CVPR.2019.00656 – ident: ref64 doi: 10.1109/TIP.2021.3065822 – ident: ref34 doi: 10.1109/ICCV.2005.239 – ident: ref57 doi: 10.1016/j.patcog.2020.107622 – ident: ref74 doi: 10.1109/ICCV.2017.324 – year: 2021 ident: ref70 article-title: Attention is not all you need: Pure attention loses rank doubly exponentially with depth – ident: ref55 doi: 10.1007/978-3-030-00934-2_3 – year: 2020 ident: ref17 article-title: Deformable DETR: Deformable transformers for end-to-end object detection – ident: ref59 doi: 10.1007/978-3-030-01228-1_15 – ident: ref14 doi: 10.1109/CVPR.2017.544 – ident: ref71 doi: 10.1109/ICCV.2019.00140 – ident: ref60 doi: 10.1016/j.ins.2020.02.067 – ident: ref67 doi: 10.1109/CVPR.2018.00337 – ident: ref16 doi: 10.1007/978-3-030-58452-8_13 – volume-title: Proc. Int. Conf. Learn. Represent. ident: ref76 article-title: Decoupled weight decay regularization – ident: ref32 doi: 10.1109/ICCV48922.2021.01204 – volume-title: Proc. Int. Conf. Learn. Represent. ident: ref2 article-title: Very deep convolutional networks for large-scale image recognition – ident: ref41 doi: 10.1109/CVPR.2018.00716 – ident: ref49 doi: 10.1109/ICCV48922.2021.00060 – ident: ref53 doi: 10.1109/ICCV48922.2021.00062 – year: 2016 ident: ref72 article-title: Gaussian error linear units (GELUs) – ident: ref10 doi: 10.1007/s11263-015-0816-y – ident: ref36 doi: 10.1109/TPAMI.2015.2389824 – ident: ref46 doi: 10.1109/WACV48630.2021.00374 – ident: ref3 doi: 10.1109/CVPR.2015.7298594 – ident: ref68 doi: 10.1109/ICCV.2017.433 – ident: ref5 doi: 10.1109/tpami.2019.2918284 – year: 2021 ident: ref52 article-title: LocalViT: Bringing locality to vision transformers – ident: ref75 doi: 10.1109/TPAMI.2018.2844175 – year: 2020 ident: ref78 article-title: MMSegmentation: OpenMMLab semantic segmentation toolbox and benchmark – year: 2021 ident: ref24 article-title: Transformer in convolutional neural networks – ident: ref62 doi: 10.1109/CVPR.2018.00567 – start-page: 6105 volume-title: Proc. Int. Conf. Mach. Learn. ident: ref7 article-title: EfficientNet: Rethinking model scaling for convolutional neural networks – ident: ref22 doi: 10.1109/ICCV48922.2021.00986 – year: 2021 ident: ref50 article-title: Conditional positional encodings for vision transformers – ident: ref1 doi: 10.1201/9781420010749 – year: 2021 ident: ref51 article-title: Token labeling: Training a 85.5% top-1 accuracy vision transformer with 56M parameters on ImageNet – ident: ref31 doi: 10.1109/ICCV48922.2021.00009 – year: 2021 ident: ref33 article-title: Demystifying local vision transformer: Sparse connectivity, weight sharing, and dynamic weight – start-page: 10347 volume-title: Proc. Int. Conf. Mach. Learn. ident: ref48 article-title: Training data-efficient image transformers distillation through attention – ident: ref21 doi: 10.1109/ICCV48922.2021.00061 – ident: ref66 doi: 10.1109/TCSVT.2019.2920407 – ident: ref44 doi: 10.1016/j.neucom.2016.12.038 – ident: ref58 doi: 10.1109/CVPRW.2015.7301274 – ident: ref40 doi: 10.1007/978-3-030-01264-9_8 – ident: ref18 doi: 10.1109/iccv48922.2021.00147 – ident: ref9 doi: 10.1109/TPAMI.2019.2938758 – year: 2019 ident: ref79 article-title: MMDetection: Open MMLab detection toolbox and benchmark – ident: ref37 doi: 10.1109/CVPR.2017.660 – year: 2017 ident: ref38 article-title: MobileNets: Efficient convolutional neural networks for mobile vision applications – start-page: 9355 volume-title: Proc. Adv. Neural Inform. Process. Syst. ident: ref28 article-title: Twins: Revisiting the design of spatial attention in vision transformers – year: 2016 ident: ref69 article-title: Layer normalization – ident: ref27 doi: 10.1109/CVPR.2017.634 – year: 2021 ident: ref47 article-title: ISTR: End-to-end instance segmentation with transformers – ident: ref25 doi: 10.1109/ICCVW54120.2021.00210
SSID	ssj0014503
Score	2.6993296
Snippet	Recently, the vision transformer has achieved great success by pushing the state-of-the-art of various vision tasks. One of the most challenging problems in...
SourceID	proquest crossref ieee
SourceType	Aggregation Database Enrichment Source Index Database Publisher
StartPage	12760
SubjectTerms	backbone network Computational modeling Computer networks Convolution efficient self-attention Feature extraction Image classification Image segmentation Network design Object recognition pyramid pooling Scene analysis scene understanding Semantic segmentation Semantics Task analysis Transformer Transformers
Title	P2T: Pyramid Pooling Transformer for Scene Understanding
URI	https://ieeexplore.ieee.org/document/9870559 https://www.proquest.com/docview/2872440427 https://www.proquest.com/docview/2708259074
Volume	45
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3fT8IwEL4AT_ogChpRNDPxTQelYx31jRgJmmB4gIS3ZetaY1QwCA_613vX_ZCoMT5tSdul6_Xa-3q9-wDOA4-cZwhTTSIQoAhuXKl6yu0msuczXP-ETwHOo3sxnHbvZv6sBJdFLIzW2l4-0y16tb78ZKHWdFTWRnzM0AIuQxmBWxqrVXgMur5lQUYLBjUcYUQeIMNkezLuj24RCnLeIrbwQBBdjSdw-kqbmflrP7IEKz9WZbvVDKowyjuZ3jB5aq1XcUt9fMvf-N-_2IWdzOZ0-ukk2YOSntegmvM5OJl612B7IzlhHXpjPrlyxu_L6OUxccYLYvd5cCa5oYsN8YGNcbF0ppshMvswHdxMroduxrPgKo_3Vq6WUUQ4TsdRLJhWjCUmRjNDeR3PKKGl15VMMqM7xiQxJVFMYtYxSsVexARq8AFU5ou5PgSHJ36gAzQKAzQbAoGFyuAGSNGyhkciaEAnH-1QZUnIiQvjObRghMnQCiskYYWZsBpwUbR5TVNw_Fm7TkNe1MxGuwHNXKhhpqVvIaJFTvkROfbrrChG_SKnSTTXizXWCQhE0xHC0e9fPoYtoqBP4xObUFkt1_oEDZVVfGpn6Cfr4d9t
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwzV3NTuMwEB4BewAOy_InuhQIEpxQius0To3EAcGiFijqIZW4hcSxEQLaFbRalWfhVXg3ZpwfEKC9IXFKpNiW4xnPzOfxzABsBR45zxCmmlQgQBHcuFI1ldtIZdNnKP-ETwHOnXPR6jVOLvyLCXgqY2G01vbyma7Rq_XlpwM1oqOyXcTHDC3g_ArlqR7_Q4D2sN8-Qmpuc378JzxsuXkNAVd5vDl0tYxjwig6iRPBtGIsNQmqUOXVPaOEll5DMsmMrhuTJpQgME1Y3SiVeDETyJ047iT8QDvD51l0WOmjaPi27jLaTChTELgUITlM7obdg04bwSfnNapPHggqkOMJ3DDS5oJ-1YC2pMsHPWCV2_EcPBfLkt1puamNhklNPb7LGPld1-0X_Mytaucg2wbzMKH7CzBXVKxwcgG2ALNv0i8uQrPLwz2nO76P765Tpzug-kVXTliY8tgRH9gZ1YHTexsEtAS9L_mbZZjqD_p6BRye-oEO0OwN0DAKBH5UBlU8xQMbHougAvWCupHK06xTtY_byMItJiPLHBExR5QzRwV2yj5_syQj_229SCQuW-bUrUC1YKIol0MPEeJhThkgOc5rs_yMEoTcQnFfD0bYJqBjAjok-f35yBsw3Qo7Z9FZ-_x0FWZwFl4WjVmFqeH9SK-hWTZM1u3ucODyq9nqBZMPP7Y
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=P2T%3A+Pyramid+Pooling+Transformer+for+Scene+Understanding&rft.jtitle=IEEE+transactions+on+pattern+analysis+and+machine+intelligence&rft.au=Wu%2C+Yu-Huan&rft.au=Liu%2C+Yun&rft.au=Zhan%2C+Xin&rft.au=Cheng%2C+Ming-Ming&rft.date=2023-11-01&rft.issn=0162-8828&rft.eissn=2160-9292&rft.volume=45&rft.issue=11&rft.spage=12760&rft.epage=12771&rft_id=info:doi/10.1109%2FTPAMI.2022.3202765&rft.externalDBID=n%2Fa&rft.externalDocID=10_1109_TPAMI_2022_3202765
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0162-8828&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0162-8828&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0162-8828&client=summon