Methods to impute missing genotypes for population data

For large-scale genotyping studies, it is common for most subjects to have some missing genetic markers, even if the missing rate per marker is low. This compromises association analyses, with varying numbers of subjects contributing to analyses when performing single-marker or multi-marker analyses...

Full description

Saved in:

Bibliographic Details
Published in	Human genetics Vol. 122; no. 5; pp. 495 - 504
Main Authors	Yu, Zhaoxia, Schaid, Daniel J.
Format	Journal Article
Language	English
Published	Heidelberg Springer 01.12.2007 Berlin Springer Nature B.V New York, NY
Subjects	Algorithms Analysis Biological and medical sciences Classical genetics, quantitative genetics, hybrids Fundamental and applied biological sciences. Psychology Generalized linear models Genetic aspects Genetic Markers Genetics of eukaryotes. Biological and molecular evolution Genetics, Population - statistics & numerical data Genotype Haplotypes Human Humans Linear Models Linkage Disequilibrium Maximum likelihood method Methods Models, Genetic Models, Statistical Polymorphism, Single Nucleotide Regression analysis Single nucleotide polymorphisms Statistical methods Statistics, Nonparametric Genotype Genetics Method
Online Access	Get full text

Cover

Loading…

More Information
Summary:	For large-scale genotyping studies, it is common for most subjects to have some missing genetic markers, even if the missing rate per marker is low. This compromises association analyses, with varying numbers of subjects contributing to analyses when performing single-marker or multi-marker analyses. In this paper, we consider eight methods to infer missing genotypes, including two haplotype reconstruction methods (local expectation maximization-EM, and fastPHASE), two k-nearest neighbor methods (original k-nearest neighbor, KNN, and a weighted k-nearest neighbor, wtKNN), three linear regression methods (backward variable selection, LM.back, least angle regression, LM.lars, and singular value decomposition, LM.svd), and a regression tree, Rtree. We evaluate the accuracy of them using single nucleotide polymorphism (SNP) data from the HapMap project, under a variety of conditions and parameters. We find that fastPHASE has the lowest error rates across different analysis panels and marker densities. LM.lars gives slightly less accurate estimate of missing genotypes than fastPHASE, but has better performance than the other methods.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	0340-6717 1432-1203 1432-1203
DOI:	10.1007/s00439-007-0427-y