BERTHop: An Effective Vision-and-Language Model for Chest X-ray Disease Diagnosis

Vision-and-language (V&L) models take image and text as input and learn to capture the associations between them. Prior studies show that pre-trained V&L models can significantly improve the model performance for downstream tasks such as Visual Question Answering (VQA). However, V&L mode...

Full description

Saved in:

Bibliographic Details
Published in	2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW) pp. 3327 - 3336
Main Authors	Monajatipoor, Masoud, Rouhsedaghat, Mozhdeh, Li, Liunian Harold, Chien, Aichi, Jay Kuo, C.-C., Scalzo, Fabien, Chang, Kai-Wei
Format	Conference Proceeding
Language	English
Published	IEEE 01.10.2021
Subjects	Computer vision Conferences Data models Knowledge discovery Medical diagnosis Transformers Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Vision-and-language (V&L) models take image and text as input and learn to capture the associations between them. Prior studies show that pre-trained V&L models can significantly improve the model performance for downstream tasks such as Visual Question Answering (VQA). However, V&L models are less effective when applied in the medical domain (e.g., on X-ray images and clinical notes) due to the domain gap. In this paper, we investigate the challenges of applying pre-trained V&L models in medical applications. In particular, we identify that the visual representation in general V&L models is not suitable for processing medical data. To overcome this limitation, we propose BERTHop, a transformer-based model based on PixelHop++ and VisualBERT, for better capturing the associations between the two modalities. Experiments on the Openl dataset, a commonly used thoracic disease diagnosis benchmark, show that BERTHop achieves an average Area Under the Curve (AUC) of 98.12% which is 1.62% higher than state-of-the-art (SOTA) while it is trained on a 9x smaller dataset.
ISSN:	2473-9944
DOI:	10.1109/ICCVW54120.2021.00372