Multi-Modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training

Recently a number of studies demonstrated impressive performance on diverse vision-language multi-modal tasks such as image captioning and visual question answering by extending the BERT architecture with multi-modal pre-training objectives. In this work we explore a broad set of multi-modal represe...

Full description

Saved in:

Bibliographic Details
Published in	IEEE journal of biomedical and health informatics Vol. 26; no. 12; pp. 6070 - 6080
Main Authors	Moon, Jong Hak, Lee, Hyungyung, Shin, Woncheol, Kim, Young-Hak, Choi, Edward
Format	Journal Article
Language	English
Published	United States IEEE 01.12.2022 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Cognitive tasks Feature extraction Healthcare Humans Image classification Language medical Medical diagnosis Medical diagnostic imaging Medical imaging Medical Records MIMICs multimodal learning Questions Radiology representation learning self-supervised learning Task analysis Training Vision vision-and-language Visualization X-rays
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Recently a number of studies demonstrated impressive performance on diverse vision-language multi-modal tasks such as image captioning and visual question answering by extending the BERT architecture with multi-modal pre-training objectives. In this work we explore a broad set of multi-modal representation learning tasks in the medical domain, specifically using radiology images and the unstructured report. We propose Medical Vision Language Learner (MedViLL), which adopts a BERT-based architecture combined with a novel multi-modal attention masking scheme to maximize generalization performance for both vision-language understanding tasks (diagnosis classification, medical image-report retrieval, medical visual question answering) and vision-language generation task (radiology report generation). By statistically and rigorously evaluating the proposed model on four downstream tasks with three radiographic image-report datasets (MIMIC-CXR, Open-I, and VQA-RAD), we empirically demonstrate the superior downstream task performance of MedViLL against various baselines, including task-specific architectures.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	2168-2194 2168-2208
DOI:	10.1109/JBHI.2022.3207502