Does your dermatology classifier know what it doesn’t know? Detecting the long-tail of unseen conditions

•We propose a novel hierarchical outlier detection (HOD) loss, and show that this outperforms existing outlier exposure based techniques for detecting OOD inputs.•We introduce a near-OOD benchmarking framework and the key design choices needed for proper validation of OOD detection algorithms.•We de...

Full description

Saved in:
Bibliographic Details
Published inMedical image analysis Vol. 75; p. 102274
Main Authors Guha Roy, Abhijit, Ren, Jie, Azizi, Shekoofeh, Loh, Aaron, Natarajan, Vivek, Mustafa, Basil, Pawlowski, Nick, Freyberg, Jan, Liu, Yuan, Beaver, Zach, Vo, Nam, Bui, Peggy, Winter, Samantha, MacWilliams, Patricia, Corrado, Greg S., Telang, Umesh, Liu, Yun, Cemgil, Taylan, Karthikesalingam, Alan, Lakshminarayanan, Balaji, Winkens, Jim
Format Journal Article
LanguageEnglish
Published Netherlands Elsevier B.V 01.01.2022
Elsevier BV
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:•We propose a novel hierarchical outlier detection (HOD) loss, and show that this outperforms existing outlier exposure based techniques for detecting OOD inputs.•We introduce a near-OOD benchmarking framework and the key design choices needed for proper validation of OOD detection algorithms.•We demonstrated the added utility of the novel HOD loss in the context of multiple different state-of-the-art representation learning methods (self-supervised contrastive pre-training based SimCLR and MICLe). We also show the OOD detection performance gains on large scale standard benchmarks (ImageNet and BiT model pre-trained on a large-scale JFT dataset).•We propose to use a diverse ensemble with different representation learning and objectives for improved OOD detection performance. We demonstrate its superiority over vanilla ensembles and performed analysis investigating how diversity aids in better OOD detection performance.•We propose a cost-weighted evaluation metric for model trust analysis that incorporates the downstream clinical implications to aid assessment of real-world impact. [Display omitted] Supervised deep learning models have proven to be highly effective in classification of dermatological conditions. These models rely on the availability of abundant labeled training examples. However, in the real-world, many dermatological conditions are individually too infrequent for per-condition classification with supervised learning. Although individually infrequent, these conditions may collectively be common and therefore are clinically significant in aggregate. To prevent models from generating erroneous outputs on such examples, there remains a considerable unmet need for deep learning systems that can better detect such infrequent conditions. These infrequent ‘outlier’ conditions are seen very rarely (or not at all) during training. In this paper, we frame this task as an out-of-distribution (OOD) detection problem. We set up a benchmark ensuring that outlier conditions are disjoint between the model training, validation, and test sets. Unlike traditional OOD detection benchmarks where the task is to detect dataset distribution shift, we aim at the more challenging task of detecting subtle differences resulting from a different pathology or condition. We propose a novel hierarchical outlier detection (HOD) loss, which assigns multiple abstention classes corresponding to each training outlier class and jointly performs a coarse classification of inliers vs. outliers, along with fine-grained classification of the individual classes. We demonstrate that the proposed HOD loss based approach outperforms leading methods that leverage outlier data during training. Further, performance is significantly boosted by using recent representation learning methods (BiT, SimCLR, MICLe). Further, we explore ensembling strategies for OOD detection and propose a diverse ensemble selection process for the best result. We also perform a subgroup analysis over conditions of varying risk levels and different skin types to investigate how OOD performance changes over each subgroup and demonstrate the gains of our framework in comparison to baseline. Furthermore, we go beyond traditional performance metrics and introduce a cost matrix for model trust analysis to approximate downstream clinical impact. We use this cost matrix to compare the proposed method against the baseline, thereby making a stronger case for its effectiveness in real-world scenarios.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:1361-8415
1361-8423
DOI:10.1016/j.media.2021.102274