Machine learning algorithm validation with a limited sample size

Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participant...

Full description

Saved in:

Bibliographic Details
Published in	PloS one Vol. 14; no. 11; p. e0224365
Main Authors	Vabalas, Andrius, Gowen, Emma, Poliakoff, Ellen, Casson, Alexander J.
Format	Journal Article
Language	English
Published	United States Public Library of Science 07.11.2019 Public Library of Science (PLoS)
Subjects	Accuracy Algorithms Artificial intelligence Autism Bias Bioinformatics Biological markers Biology and Life Sciences Biomarkers Biomedical Research - statistics & numerical data Brain research Classification Computer and Information Sciences Computer simulation Data analysis Data collection Data Interpretation, Statistical Data mining Datasets Diagnostic imaging Estimates Humans Internet of Things Learning algorithms Machine Learning Medical imaging Medicine and Health Sciences Methods Neuroimaging Neurology Noise Normal distribution Parameters Pattern recognition Physical Sciences Research and Analysis Methods Sample Size Social Sciences Spectrum analysis Studies Technology Test procedures Tracking United Kingdom > UK England
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 Competing Interests: The authors have declared that no competing interests exist.
ISSN:	1932-6203 1932-6203
DOI:	10.1371/journal.pone.0224365