Unsupervised representation learning on high-dimensional clinical data improves genomic discovery and prediction

Although high-dimensional clinical data (HDCD) are increasingly available in biobank-scale datasets, their use for genetic discovery remains challenging. Here we introduce an unsupervised deep learning model, Representation Learning for Genetic Discovery on Low-Dimensional Embeddings (REGLE), for di...

Full description

Saved in:
Bibliographic Details
Published inNature genetics Vol. 56; no. 8; pp. 1604 - 1613
Main Authors Yun, Taedong, Cosentino, Justin, Behsaz, Babak, McCaw, Zachary R., Hill, Davin, Luben, Robert, Lai, Dongbing, Bates, John, Yang, Howard, Schwantes-An, Tae-Hwi, Zhou, Yuchen, Khawaja, Anthony P., Carroll, Andrew, Hobbs, Brian D., Cho, Michael H., McLean, Cory Y., Hormozdiari, Farhad
Format Journal Article
LanguageEnglish
Published New York Nature Publishing Group US 01.08.2024
Nature Publishing Group
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Although high-dimensional clinical data (HDCD) are increasingly available in biobank-scale datasets, their use for genetic discovery remains challenging. Here we introduce an unsupervised deep learning model, Representation Learning for Genetic Discovery on Low-Dimensional Embeddings (REGLE), for discovering associations between genetic variants and HDCD. REGLE leverages variational autoencoders to compute nonlinear disentangled embeddings of HDCD, which become the inputs to genome-wide association studies (GWAS). REGLE can uncover features not captured by existing expert-defined features and enables the creation of accurate disease-specific polygenic risk scores (PRSs) in datasets with very few labeled data. We apply REGLE to perform GWAS on respiratory and circulatory HDCD—spirograms measuring lung function and photoplethysmograms measuring blood volume changes. REGLE replicates known loci while identifying others not previously detected. REGLE are predictive of overall survival, and PRSs constructed from REGLE loci improve disease prediction across multiple biobanks. Overall, REGLE contain clinically relevant information beyond that captured by existing expert-defined features, leading to improved genetic discovery and disease prediction. Representation Learning for Genetic Discovery on Low-Dimensional Embeddings (REGLE) uses machine learning to generate low-dimensional representations of healthcare data. Applied to lung spirograms and blood volume photoplethysmograms, REGLE factors capture additional information beyond expert-defined features, suggesting the utility of this approach.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:1061-4036
1546-1718
1546-1718
DOI:10.1038/s41588-024-01831-6