Evaluating the state of the art in missing data imputation for clinical data

Abstract Clinical data are increasingly being mined to derive new medical knowledge with a goal of enabling greater diagnostic precision, better-personalized therapeutic regimens, improved clinical outcomes and more efficient utilization of health-care resources. However, clinical data are often onl...

Full description

Saved in:

Bibliographic Details
Published in	Briefings in bioinformatics Vol. 23; no. 1
Main Author	Luo, Yuan
Format	Journal Article
Language	English
Published	England Oxford University Press 17.01.2022 Oxford Publishing Limited (England)
Subjects	Algorithms Animal models Case Study Cross-Sectional Studies Cross-sections Data Collection Humans Laboratories Laboratory tests Learning algorithms Machine Learning Mathematical models Missing data Models, Statistical Patients Performance evaluation State-of-the-art reviews Statistical analysis Statistical models Time series time series machine learning clinical laboratory test missing data imputation
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Abstract Clinical data are increasingly being mined to derive new medical knowledge with a goal of enabling greater diagnostic precision, better-personalized therapeutic regimens, improved clinical outcomes and more efficient utilization of health-care resources. However, clinical data are often only available at irregular intervals that vary between patients and type of data, with entries often being unmeasured or unknown. As a result, missing data often represent one of the major impediments to optimal knowledge derivation from clinical data. The Data Analytics Challenge on Missing data Imputation (DACMI) presented a shared clinical dataset with ground truth for evaluating and advancing the state of the art in imputing missing data for clinical time series. We extracted 13 commonly measured blood laboratory tests. To evaluate the imputation performance, we randomly removed one recorded result per laboratory test per patient admission and used them as the ground truth. DACMI is the first shared-task challenge on clinical time series imputation to our best knowledge. The challenge attracted 12 international teams spanning three continents across multiple industries and academia. The evaluation outcome suggests that competitive machine learning and statistical models (e.g. LightGBM, MICE and XGBoost) coupled with carefully engineered temporal and cross-sectional features can achieve strong imputation performance. However, care needs to be taken to prevent overblown model complexity. The challenge participating systems collectively experimented with a wide range of machine learning and probabilistic algorithms to combine temporal imputation and cross-sectional imputation, and their design principles will inform future efforts to better model clinical missing data.
Bibliography:	SourceType-Scholarly Journals-1 content type line 14 ObjectType-Report-1 ObjectType-Article-1 ObjectType-Feature-2 content type line 23
ISSN:	1467-5463 1477-4054 1477-4054
DOI:	10.1093/bib/bbab489