Label scarcity in biomedicine: Data-rich latent factor discovery enhances phenotype prediction

High-quality data accumulation is now becoming ubiquitous in the health domain. There is increasing opportunity to exploit rich data from normal subjects to improve supervised estimators in specific diseases with notorious data scarcity. We demonstrate that low-dimensional embedding spaces can be de...

Full description

Saved in:

Bibliographic Details
Main Authors	Schulz, Marc-Andre, Thirion, Bertrand, Gramfort, Alexandre, Varoquaux, Gaël, Bzdok, Danilo
Format	Journal Article
Language	English
Published	12.10.2021
Subjects	Computer Science - Learning Quantitative Biology - Neurons and Cognition
Online Access	Get full text
DOI	10.48550/arxiv.2110.06135

Cover

Loading…

More Information
Summary:	High-quality data accumulation is now becoming ubiquitous in the health domain. There is increasing opportunity to exploit rich data from normal subjects to improve supervised estimators in specific diseases with notorious data scarcity. We demonstrate that low-dimensional embedding spaces can be derived from the UK Biobank population dataset and used to enhance data-scarce prediction of health indicators, lifestyle and demographic characteristics. Phenotype predictions facilitated by Variational Autoencoder manifolds typically scaled better with increasing unlabeled data than dimensionality reduction by PCA or Isomap. Performances gains from semisupervison approaches will probably become an important ingredient for various medical data science applications.
DOI:	10.48550/arxiv.2110.06135