GBScleanR: Robust genotyping error correction using hidden Markov model with error pattern recognition

The developments in sequencing technology have enabled researchers to acquire genotype data from large populations with dense markers. Recently developed methods that are based on reduced representation sequencing (RRS), such as Genotyping By Sequencing (GBS), provide cost-effective and time-saving...

Full description

Saved in:

Bibliographic Details
Published in	bioRxiv
Main Authors	Furuta, Tomoyuki, Yamamoto, Toshio, Ashikari, Motoyuki
Format	Paper
Language	English
Published	Cold Spring Harbor Cold Spring Harbor Laboratory Press 22.03.2022
Subjects	Alleles Error correction & detection Genotype & phenotype Genotypes Genotyping Markov chains Pattern recognition
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The developments in sequencing technology have enabled researchers to acquire genotype data from large populations with dense markers. Recently developed methods that are based on reduced representation sequencing (RRS), such as Genotyping By Sequencing (GBS), provide cost-effective and time-saving genotyping platforms; however, many drawbacks that are associated with these technologies, such as missing and false homozygous calls at heterozygous sites, significantly affect the accuracy. Several error correction methods that incorporate allele read counts in a hidden Markov model (HMM) have been developed to overcome these issues. Those methods assume that markers have a uniform error rate with no bias in the allele read ratio and infer a 50% chance of obtaining a read for either allele at a heterozygous site. However, bias does occur because of uneven amplification of genomic fragments and read mismapping. In this paper we introduce a novel error correction tool, GBScleanR, which enables robust and precise error correction for noisy RRS-based genotype data by incorporating marker-specific error rates into the HMM. The results indicate that GBScleanR improves the accuracy by 10-40 percentage points as compared to the existing tools in simulation datasets and achieves the most reliable genotype estimation in real data even with error prone markers. Competing Interest Statement The authors have declared no competing interest. Footnotes * Add a few lines in the Result section for the algorithm evaluation using real data.
DOI:	10.1101/2022.03.18.484886