An effective CNN and Transformer complementary network for medical image segmentation

•We design dual encoding paths of CNN and Transformer encoders for producing complementary features..•We propose an effective feature complementary module by cross-wisely fusing features from CNN and Transformer domains.•We propose to compute the cross-domain correlation between CNN and Transformer...

Full description

Saved in:
Bibliographic Details
Published inPattern recognition Vol. 136; p. 109228
Main Authors Yuan, Feiniu, Zhang, Zhengxiao, Fang, Zhijun
Format Journal Article
LanguageEnglish
Published Elsevier Ltd 01.04.2023
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:•We design dual encoding paths of CNN and Transformer encoders for producing complementary features..•We propose an effective feature complementary module by cross-wisely fusing features from CNN and Transformer domains.•We propose to compute the cross-domain correlation between CNN and Transformer features, and the channel attention on the self-attention features of Transformers to extract dual attention.•We design a Swin Transformer decoder with multi-level skip connections between the features of the Transformer decoder and the complementary features for jointly extracting contextual and long-range dependency. The Transformer network was originally proposed for natural language processing. Due to its powerful representation ability for long-range dependency, it has been extended for vision tasks in recent years. To fully utilize the advantages of Transformers and Convolutional Neural Networks (CNNs), we propose a CNN and Transformer Complementary Network (CTCNet) for medical image segmentation. We first design two encoders by Swin Transformers and Residual CNNs to produce complementary features in Transformer and CNN domains, respectively. Then we cross-wisely concatenate these complementary features to propose a Cross-domain Fusion Block (CFB) for effectively blending them. In addition, we compute the correlation between features from the CNN and Transformer domains, and apply channel attention to the self-attention features by Transformers for capturing dual attention information. We incorporate cross-domain fusion, feature correlation and dual attention together to propose a Feature Complementary Module (FCM) for improving the representation ability of features. Finally, we design a Swin Transformer decoder to further improve the representation ability of long-range dependencies, and propose to use skip connections between the Transformer decoded features and the complementary features for extracting spatial details, contextual semantics and long-range information. Skip connections are performed in different levels for enhancing multi-scale invariance. Experimental results show that our CTCNet significantly surpasses the state-of-the-art image segmentation models based on CNNs, Transformers, and even Transformer and CNN combined models designed for medical image segmentation. It achieves superior performance on different medical applications, including multi-organ segmentation and cardiac segmentation.
ISSN:0031-3203
1873-5142
DOI:10.1016/j.patcog.2022.109228