Accurate and Scalable Construction of Polygenic Scores in Large Biobank Data Sets

Accurate construction of polygenic scores (PGS) can enable early diagnosis of diseases and facilitate the development of personalized medicine. Accurate PGS construction requires prediction models that are both adaptive to different genetic architectures and scalable to biobank scale datasets with m...

Full description

Saved in:

Bibliographic Details
Published in	American journal of human genetics Vol. 106; no. 5; pp. 679 - 693
Main Authors	Yang, Sheng, Zhou, Xiang
Format	Journal Article
Language	English
Published	United States Elsevier Inc 07.05.2020 Elsevier
Subjects	Bayes Theorem complex traits Databases, Factual - standards Datasets as Topic - standards deterministic Bayesian sparse linear mixed model Female Humans Linear Models Male Multifactorial Inheritance polygenic risk score polygenic score Polymorphism, Single Nucleotide Reproducibility of Results Sample Size UK Biobank United Kingdom White People - genetics United Kingdom complex traits deterministic Bayesian sparse linear mixed model polygenic risk score UK Biobank polygenic score
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Accurate construction of polygenic scores (PGS) can enable early diagnosis of diseases and facilitate the development of personalized medicine. Accurate PGS construction requires prediction models that are both adaptive to different genetic architectures and scalable to biobank scale datasets with millions of individuals and tens of millions of genetic variants. Here, we develop such a method called Deterministic Bayesian Sparse Linear Mixed Model (DBSLMM). DBSLMM relies on a flexible modeling assumption on the effect size distribution to achieve robust and accurate prediction performance across a range of genetic architectures. DBSLMM also relies on a simple deterministic search algorithm to yield an approximate analytic estimation solution using summary statistics only. The deterministic search algorithm, when paired with further algebraic innovations, results in substantial computational savings. With simulations, we show that DBSLMM achieves scalable and accurate prediction performance across a range of realistic genetic architectures. We then apply DBSLMM to analyze 25 traits in UK Biobank. For these traits, compared to existing approaches, DBSLMM achieves an average of 2.03%–101.09% accuracy gain in internal cross-validations. In external validations on two separate datasets, including one from BioBank Japan, DBSLMM achieves an average of 14.74%–522.74% accuracy gain. In these real data applications, DBSLMM is 1.03–28.11 times faster and uses only 7.4%–24.8% of physical memory as compared to other multiple regression-based PGS methods. Overall, DBSLMM represents an accurate and scalable method for constructing PGS in biobank scale datasets.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	0002-9297 1537-6605 1537-6605
DOI:	10.1016/j.ajhg.2020.03.013