Controlling for population structure and genotyping platform bias in the eMERGE multi-institutional biobank linked to electronic health records

Combining samples across multiple cohorts in large-scale scientific research programs is often required to achieve the necessary power for genome-wide association studies. Controlling for genomic ancestry through principal component analysis (PCA) to address the effect of population stratification i...

Full description

Saved in:
Bibliographic Details
Published inFrontiers in genetics Vol. 5; p. 352
Main Authors Crosslin, David R., Tromp, Gerard, Burt, Amber, Kim, Daniel S., Verma, Shefali S., Lucas, Anastasia M., Bradford, Yuki, Crawford, Dana C., Armasu, Sebastian M., Heit, John A., Hayes, M. Geoffrey, Kuivaniemi, Helena, Ritchie, Marylyn D., Jarvik, Gail P., de Andrade, Mariza
Format Journal Article
LanguageEnglish
Published Switzerland Frontiers Media S.A 04.11.2014
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Combining samples across multiple cohorts in large-scale scientific research programs is often required to achieve the necessary power for genome-wide association studies. Controlling for genomic ancestry through principal component analysis (PCA) to address the effect of population stratification is a common practice. In addition to local genomic variation, such as copy number variation and inversions, other factors directly related to combining multiple studies, such as platform and site recruitment bias, can drive the correlation patterns in PCA. In this report, we describe the combination and analysis of multi-ethnic cohort with biobanks linked to electronic health records for large-scale genomic association discovery analyses. First, we outline the observed site and platform bias, in addition to ancestry differences. Second, we outline a general protocol for selecting variants for input into the subject variance-covariance matrix, the conventional PCA approach. Finally, we introduce an alternative approach to PCA by deriving components from subject loadings calculated from a reference sample. This alternative approach of generating principal components controlled for site and platform bias, in addition to ancestry differences, has the advantage of fewer covariates and degrees of freedom.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
This article was submitted to Applied Genetic Epidemiology, a section of the journal Frontiers in Genetics.
Edited by: Karen T. Cuenco, Genentech, USA
Reviewed by: Alexis C. Frazier-Wood, University of Alabama at Birmingham, USA; Tesfaye B. Mersha, Cincinnati Children's Hospital Medical Center, USA
These authors have contributed equally to this work.
ISSN:1664-8021
1664-8021
DOI:10.3389/fgene.2014.00352