Biases and Errors on Allele Frequency Estimation and Disease Association Tests of Next-Generation Sequencing of Pooled Samples

Next‐generation sequencing is widely used to study complex diseases because of its ability to identify both common and rare variants without prior single nucleotide polymorphism (SNP) information. Pooled sequencing of implicated target regions can lower costs and allow more samples to be analyzed, t...

Full description

Saved in:
Bibliographic Details
Published inGenetic epidemiology Vol. 36; no. 6; pp. 549 - 560
Main Authors Chen, Xiaowei, Listman, Jennifer B., Slack, Frank J., Gelernter, Joel, Zhao, Hongyu
Format Journal Article
LanguageEnglish
Published United States Blackwell Publishing Ltd 01.09.2012
Wiley Subscription Services, Inc
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Next‐generation sequencing is widely used to study complex diseases because of its ability to identify both common and rare variants without prior single nucleotide polymorphism (SNP) information. Pooled sequencing of implicated target regions can lower costs and allow more samples to be analyzed, thus improving statistical power for disease‐associated variant detection. Several methods for disease association tests of pooled data and for optimal pooling designs have been developed under certain assumptions of the pooling process, for example, equal/unequal contributions to the pool, sequencing depth variation, and error rate. However, these simplified assumptions may not portray the many factors affecting pooled sequencing data quality, such as PCR amplification during target capture and sequencing, reference allele preferential bias, and others. As a result, the properties of the observed data may differ substantially from those expected under the simplified assumptions. Here, we use real datasets from targeted sequencing of pooled samples, together with microarray SNP genotypes of the same subjects, to identify and quantify factors (biases and errors) affecting the observed sequencing data. Through simulations, we find that these factors have a significant impact on the accuracy of allele frequency estimation and the power of association tests. Furthermore, we develop a workflow protocol to incorporate these factors in data analysis to reduce the potential biases and errors in pooled sequencing data and to gain better estimation of allele frequencies. The workflow, Psafe, is available at http://bioinformatics.med.yale.edu/group/.
Bibliography:National Institutes of Health - No. N01-HG-65403
istex:7C5156200E06952BA496668A03A1FCCEA06FD2B4
ark:/67375/WNG-MZLNB4TD-L
NIH - No. RR19895
ArticleID:GEPI21648
ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ISSN:0741-0395
1098-2272
1098-2272
DOI:10.1002/gepi.21648