Lightweight Real-Time Semantic Segmentation Network With Efficient Transformer and CNN

In the past decade, convolutional neural networks (CNNs) have shown prominence for semantic segmentation. Although CNN models have very impressive performance, the ability to capture global representation is still insufficient, which results in suboptimal results. Recently, Transformer achieved huge...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on intelligent transportation systems Vol. 24; no. 12; pp. 15897 - 15906
Main Authors	Xu, Guoan, Li, Juncheng, Gao, Guangwei, Lu, Huimin, Yang, Jian, Yue, Dong
Format	Journal Article
Language	English
Published	New York IEEE 01.12.2023 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Artificial neural networks Computational modeling Computer vision Convolution convolutional neural network Convolutional neural networks Datasets Image processing Lightweight lightweight network Modules Real time Real-time semantic segmentation Semantic segmentation Semantics Source code Task analysis transformer Transformers
Online Access	Get full text

Cover

Loading…

More Information
Summary:	In the past decade, convolutional neural networks (CNNs) have shown prominence for semantic segmentation. Although CNN models have very impressive performance, the ability to capture global representation is still insufficient, which results in suboptimal results. Recently, Transformer achieved huge success in NLP tasks, demonstrating its advantages in modeling long-range dependency. Recently, Transformer has also attracted tremendous attention from computer vision researchers who reformulate the image processing tasks as a sequence-to-sequence prediction but resulted in deteriorating local feature details. In this work, we propose a lightweight real-time semantic segmentation network called LETNet. LETNet combines a U-shaped CNN with Transformer effectively in a capsule embedding style to compensate for respective deficiencies. Meanwhile, the elaborately designed Lightweight Dilated Bottleneck (LDB) module and Feature Enhancement (FE) module cultivate a positive impact on training from scratch simultaneously. Extensive experiments performed on challenging datasets demonstrate that LETNet achieves superior performances in accuracy and efficiency balance. Specifically, It only contains 0.95M parameters and 13.6G FLOPs but yields 72.8% mIoU at 120 FPS on the Cityscapes test set and 70.5% mIoU at 250 FPS on the CamVid test dataset using a single RTX 3090 GPU. Source code will be available at https://github.com/IVIPLab/LETNet .
ISSN:	1524-9050 1558-0016
DOI:	10.1109/TITS.2023.3248089