Biases and Errors on Allele Frequency Estimation and Disease Association Tests of Next-Generation Sequencing of Pooled Samples

Next‐generation sequencing is widely used to study complex diseases because of its ability to identify both common and rare variants without prior single nucleotide polymorphism (SNP) information. Pooled sequencing of implicated target regions can lower costs and allow more samples to be analyzed, t...

Full description

Saved in:

Bibliographic Details
Published in	Genetic epidemiology Vol. 36; no. 6; pp. 549 - 560
Main Authors	Chen, Xiaowei, Listman, Jennifer B., Slack, Frank J., Gelernter, Joel, Zhao, Hongyu
Format	Journal Article
Language	English
Published	United States Blackwell Publishing Ltd 01.09.2012 Wiley Subscription Services, Inc
Subjects	allele frequency estimation Alleles Bias Bioinformatics Computer Simulation Data processing Disease disease association tests Electronic Data Processing - methods Gene Frequency Gene polymorphism Genetic Testing - methods Humans Models, Genetic next-generation sequencing Polymorphism, Single Nucleotide pooled sequencing Population genetics Research Design Sequence Analysis, DNA - methods Single-nucleotide polymorphism Software
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Next‐generation sequencing is widely used to study complex diseases because of its ability to identify both common and rare variants without prior single nucleotide polymorphism (SNP) information. Pooled sequencing of implicated target regions can lower costs and allow more samples to be analyzed, thus improving statistical power for disease‐associated variant detection. Several methods for disease association tests of pooled data and for optimal pooling designs have been developed under certain assumptions of the pooling process, for example, equal/unequal contributions to the pool, sequencing depth variation, and error rate. However, these simplified assumptions may not portray the many factors affecting pooled sequencing data quality, such as PCR amplification during target capture and sequencing, reference allele preferential bias, and others. As a result, the properties of the observed data may differ substantially from those expected under the simplified assumptions. Here, we use real datasets from targeted sequencing of pooled samples, together with microarray SNP genotypes of the same subjects, to identify and quantify factors (biases and errors) affecting the observed sequencing data. Through simulations, we find that these factors have a significant impact on the accuracy of allele frequency estimation and the power of association tests. Furthermore, we develop a workflow protocol to incorporate these factors in data analysis to reduce the potential biases and errors in pooled sequencing data and to gain better estimation of allele frequencies. The workflow, Psafe, is available at http://bioinformatics.med.yale.edu/group/.
Bibliography:	National Institutes of Health - No. N01-HG-65403 istex:7C5156200E06952BA496668A03A1FCCEA06FD2B4 ark:/67375/WNG-MZLNB4TD-L NIH - No. RR19895 ArticleID:GEPI21648 ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	0741-0395 1098-2272 1098-2272
DOI:	10.1002/gepi.21648