Context and Spatial Feature Calibration for Real-Time Semantic Segmentation

Context modeling or multi-level feature fusion methods have been proved to be effective in improving semantic segmentation performance. However, they are not specialized to deal with the problems of pixel-context mismatch and spatial feature misalignment, and the high computational complexity hinder...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on image processing Vol. 32; pp. 5465 - 5477
Main Authors	Li, Kaige, Geng, Qichuan, Wan, Maoxian, Cao, Xiaochun, Zhou, Zhong
Format	Journal Article
Language	English
Published	New York IEEE 2023 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Aggregates Calibration Context context feature calibration Context modeling Misalignment Modules Pixels Real time Real-time semantic segmentation Real-time systems Sampling Semantic segmentation Semantics spatial feature calibration Transformers
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Context modeling or multi-level feature fusion methods have been proved to be effective in improving semantic segmentation performance. However, they are not specialized to deal with the problems of pixel-context mismatch and spatial feature misalignment, and the high computational complexity hinders their widespread application in real-time scenarios. In this work, we propose a lightweight Context and Spatial Feature Calibration Network (CSFCN) to address the above issues with pooling-based and sampling-based attention mechanisms. CSFCN contains two core modules: Context Feature Calibration (CFC) module and Spatial Feature Calibration (SFC) module. CFC adopts a cascaded pyramid pooling module to efficiently capture nested contexts, and then aggregates private contexts for each pixel based on pixel-context similarity to realize context feature calibration. SFC splits features into multiple groups of sub-features along the channel dimension and propagates sub-features therein by the learnable sampling to achieve spatial feature calibration. Extensive experiments on the Cityscapes and CamVid datasets illustrate that our method achieves a state-of-the-art trade-off between speed and accuracy. Concretely, our method achieves 78.7% mIoU with 70.0 FPS and 77.8% mIoU with 179.2 FPS on the Cityscapes and CamVid test sets, respectively. The code is available at https://nave.vr3i.com/ and https://github.com/kaigelee/CSFCN .
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	1057-7149 1941-0042 1941-0042
DOI:	10.1109/TIP.2023.3318967