An effective CNN and Transformer complementary network for medical image segmentation
•We design dual encoding paths of CNN and Transformer encoders for producing complementary features..•We propose an effective feature complementary module by cross-wisely fusing features from CNN and Transformer domains.•We propose to compute the cross-domain correlation between CNN and Transformer...
Saved in:
Published in | Pattern recognition Vol. 136; p. 109228 |
---|---|
Main Authors | , , |
Format | Journal Article |
Language | English |
Published |
Elsevier Ltd
01.04.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | •We design dual encoding paths of CNN and Transformer encoders for producing complementary features..•We propose an effective feature complementary module by cross-wisely fusing features from CNN and Transformer domains.•We propose to compute the cross-domain correlation between CNN and Transformer features, and the channel attention on the self-attention features of Transformers to extract dual attention.•We design a Swin Transformer decoder with multi-level skip connections between the features of the Transformer decoder and the complementary features for jointly extracting contextual and long-range dependency.
The Transformer network was originally proposed for natural language processing. Due to its powerful representation ability for long-range dependency, it has been extended for vision tasks in recent years. To fully utilize the advantages of Transformers and Convolutional Neural Networks (CNNs), we propose a CNN and Transformer Complementary Network (CTCNet) for medical image segmentation. We first design two encoders by Swin Transformers and Residual CNNs to produce complementary features in Transformer and CNN domains, respectively. Then we cross-wisely concatenate these complementary features to propose a Cross-domain Fusion Block (CFB) for effectively blending them. In addition, we compute the correlation between features from the CNN and Transformer domains, and apply channel attention to the self-attention features by Transformers for capturing dual attention information. We incorporate cross-domain fusion, feature correlation and dual attention together to propose a Feature Complementary Module (FCM) for improving the representation ability of features. Finally, we design a Swin Transformer decoder to further improve the representation ability of long-range dependencies, and propose to use skip connections between the Transformer decoded features and the complementary features for extracting spatial details, contextual semantics and long-range information. Skip connections are performed in different levels for enhancing multi-scale invariance. Experimental results show that our CTCNet significantly surpasses the state-of-the-art image segmentation models based on CNNs, Transformers, and even Transformer and CNN combined models designed for medical image segmentation. It achieves superior performance on different medical applications, including multi-organ segmentation and cardiac segmentation. |
---|---|
ISSN: | 0031-3203 1873-5142 |
DOI: | 10.1016/j.patcog.2022.109228 |