Comparison of metrics for the evaluation of medical segmentations using prostate MRI dataset

Nine previously proposed segmentation evaluation metrics, targeting medical relevance, accounting for holes, and added regions or differentiating over- and under-segmentation, were compared with 24 traditional metrics to identify those which better capture the requirements for clinical segmentation...

Full description

Saved in:

Bibliographic Details
Published in	Computers in biology and medicine Vol. 134; p. 104497
Main Authors	Nai, Ying-Hwey, Teo, Bernice W., Tan, Nadya L., O'Doherty, Sophie, Stephenson, Mary C., Thian, Yee Liang, Chiong, Edmund, Reilhac, Anthonin
Format	Journal Article
Language	English
Published	Oxford Elsevier Ltd 01.07.2021 Elsevier Limited
Subjects	Automation Correlation Deep learning Evaluation metrics Image processing Image segmentation Internal Medicine Magnetic resonance imaging Medical image segmentation Normalizing Other Performance evaluation Prostate Prostate cancer Rank evaluation Ranking Visual discrimination Deep learning Rank evaluation Prostate cancer Evaluation metrics Medical image segmentation
Online Access	Get full text
ISSN	0010-4825 1879-0534 1879-0534
DOI	10.1016/j.compbiomed.2021.104497

Cover

Loading…

More Information
Summary:	Nine previously proposed segmentation evaluation metrics, targeting medical relevance, accounting for holes, and added regions or differentiating over- and under-segmentation, were compared with 24 traditional metrics to identify those which better capture the requirements for clinical segmentation evaluation. Evaluation was first performed using 2D synthetic shapes to highlight features and pitfalls of the metrics with known ground truths (GTs) and machine segmentations (MSs). Clinical evaluation was then performed using publicly-available prostate images of 20 subjects with MSs generated by 3 different deep learning networks (DenseVNet, HighRes3DNet, and ScaleNet) and GTs drawn by 2 readers. The same readers also performed the 2D visual assessment of the MSs using a dual negative-positive grading of −5 to 5 to reflect over- and under-estimation. Nine metrics that correlated well with visual assessment were selected for further evaluation using 3 different network ranking methods - based on a single metric, normalizing the metric using 2 GTs, and ranking the network based on a metric then averaging, including leave-one-out evaluation. These metrics yielded consistent ranking with HighRes3DNet ranked first then DenseVNet and ScaleNet using all ranking methods. Relative volume difference yielded the best positivity-agreement and correlation with dual visual assessment, and thus is better for providing over- and under-estimation. Interclass Correlation yielded the strongest correlation with the absolute visual assessment (0–5). Symmetric-boundary dice consistently yielded good discrimination of the networks for all three ranking methods with relatively small variations within network. Good rank discrimination may be an additional metric feature required for better network performance evaluation. [Display omitted] •Interclass correlation correlated best with visual assessment among the 33 metrics.•Relative volume difference has better plus-minus-sign agreement with visual grading.•Newly-proposed metrics did not outperform traditional metrics on the whole.•Metrics with good rank discrimination are better for network performance evaluation.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	0010-4825 1879-0534 1879-0534
DOI:	10.1016/j.compbiomed.2021.104497