SeMask-Mask2Former: A Semantic Segmentation Model for High Resolution Remote Sensing Images

With the development of remote sensing, semantic segmentation of high-resolution remote sensing images (RSIs) is increasingly essential. At the same time, the characteristics of objects in RSIs, such as large size, variation in object scales, and complex details, make it necessary to capture both lo...

Full description

Saved in:

Bibliographic Details
Published in	2023 IEEE Aerospace Conference pp. 1 - 6
Main Authors	Qiao, Yicheng, Liu, Wei, Liang, Bin, Wang, Pengyun, Zhang, Haopeng, Yang, Junli
Format	Conference Proceeding
Language	English
Published	IEEE 04.03.2023
Subjects	Context modeling Convolutional neural networks Decoding Feature extraction Image resolution Optimization Predictive models Remote sensing Semantic segmentation Transformers
Online Access	Get full text

Cover

Loading…

More Information
Summary:	With the development of remote sensing, semantic segmentation of high-resolution remote sensing images (RSIs) is increasingly essential. At the same time, the characteristics of objects in RSIs, such as large size, variation in object scales, and complex details, make it necessary to capture both long-range context and local information. There are some methods such as Fully Convolutional Networks (FCN) and Pyramid Scene Parsing Network (PSPNet) lack the ability to capture long-range dependencies, due to the limited receptive field of Convolutional Neural Network (CNN). However, the self-attention mechanism to capture the correlation between pixels in Transformer models has remarkable capability in capturing long-range context. One of the most outstanding Transformer models is the Masked-attention Mask Transformer (Mask2Former) which adopts the mask classification method. We propose a model SeMask-Mask2Former with boundary loss. Semantically Masked (Se-Mask) is the model's backbone and Mask2Former is the decoder. Concretely, the mask classification that generates one or even more masks for specific categories to perform the elaborate segmentation is especially suitable for handling the characteristic of large within-class and small inter-class variance of RSIs. Above all, extensive experimental results show that SeMask-Mask2Former obtains better results in semantic segmentation of high-resolution RSIs on the ISPRS Potsdam dataset compared to CNN-based methods and other state-of-the-art transformer-based methods. Extensive ablation studies conducted on the Potsdam dataset verifies the contribution of each component or optimization strategy in SeMask-Mask2Former.
DOI:	10.1109/AERO55745.2023.10115761