Delving deep into spatial pooling for squeeze-and-excitation networks

•We revisit the squeeze operation in SENets, and shed lights on why and how to embed rich (both global and local) spatial information into the excitation module to improve accuracy.•We propose an integrated two-stage spatial pooling method with two efficient implementation approaches for rich descri...

Full description

Saved in:
Bibliographic Details
Published inPattern recognition Vol. 121; p. 108159
Main Authors Jin, Xin, Xie, Yanping, Wei, Xiu-Shen, Zhao, Bo-Rui, Chen, Zhao-Min, Tan, Xiaoyang
Format Journal Article
LanguageEnglish
Published Elsevier Ltd 01.01.2022
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:•We revisit the squeeze operation in SENets, and shed lights on why and how to embed rich (both global and local) spatial information into the excitation module to improve accuracy.•We propose an integrated two-stage spatial pooling method with two efficient implementation approaches for rich descriptor extraction.•We conduct extensive experiments to verify convincing improvements over SENets and their extension on various fundamental computer vision tasks. Squeeze-and-Excitation (SE) blocks have demonstrated significant accuracy gains for state-of-the-art deep architectures by re-weighting channel-wise feature responses. The SE block is an architecture unit that integrates two operations: a squeeze operation that employs global average pooling to aggregate spatial convolutional features into a channel feature, and an excitation operation that learns instance-specific channel weights from the squeezed feature to re-weight each channel. In this paper, we revisit the squeeze operation in SE blocks, and shed lights on why and how to embed rich (both global and local) information into the excitation module at minimal extra costs. In particular, we introduce a simple but effective two-stage spatial pooling process: rich descriptor extraction and information fusion. The rich descriptor extraction step aims to obtain a set of diverse (i.e., global and especially local) deep descriptors that contain more informative cues than global average-pooling. While, absorbing more information delivered by these descriptors via a fusion step can aid the excitation operation to return more accurate re-weight scores in a data-driven manner. We validate the effectiveness of our method by extensive experiments on ImageNet for image classification and on MS-COCO for object detection and instance segmentation. For these experiments, our method achieves consistent improvements over the SENets on all tasks, in some cases, by a large margin.
ISSN:0031-3203
1873-5142
DOI:10.1016/j.patcog.2021.108159